Self-Attention with Relative Position Representations(2018)
August 19, 2023The authors of Self-Attention with Relative Position Representations presented a way of injecting relative position representations in the self-attention mechanism of the Transformer. In contrast to recurrent and convolutional neural networks, Transformer does not explicitly model position information in its structure. The original position encoding employs sine and cosine functions of different frequencies. The authors of Transformer hypothesized that sinusoidal position encodings would help Transformer to generalize to sequence lengths unseen during training. Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks of Transformer. This hypothesis was shared by the relative position representations. In contrast to absolute position representations, the relative position representations are invariant to the total sequence length.
Self-attention mechanism maps an input sequence, \(x=(x_1 , \dots , x_n)\) of \(n\) elements where \(x_i\in\mathbb{R}^{d_x}\), to a new sequence, \(z=(z_1,\dots , z_n)\) of \(n\) where \(z_i\in\mathbb{R}^{d_z}\). The output is computed as a weighted sum of a linearly transformed input elements: $$ \begin{align} z_i&=\sum^n_{j=1}\alpha_{ij}(x_jW^V)\\ \alpha_{ij}&=\frac{\exp e_{ij}}{\sum^n_{k=1}\exp e_{ik}}\\ e_{ij}&=\frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d_z}} \end{align} $$ Where the projections are parameter matrices \(W^Q, W^K, W^V \in \mathbb{R}^{d_x\times d_z}\).
They extended the self-attention to consider the pairwise relationships between input elements without additional linear transformations. They modified equations (1) and (3): $$ \begin{align} z_i&=\sum^n_{j=1}\alpha_{ij}(x_jW^V+w^V_{\text{clip}(j-i,k)})\\ e_{ij}&=\frac{(x_iW^Q)(x_jW^K+w^K_{\text{clip}(j-i, k)})^T}{\sqrt{d_z}} \end{align} $$ They employed the function \(\text{clip}(x, k)\) to clip the maximum relative position to an absolute value of \(k\): $$ \text{clip}(x, k)=\max(-k, \min(k, x)) $$ They hypothesized that precise relative position information is not useful beyond a certain distance, \(k\), and the parameters of the relative position representations are \(w^K_i,w^V_i\in \mathbb{R}^{d_a}\) and \(w^K=(w^K_{-k},\dots ,w^K_{k})\) where \(w^V=(w_{-k}^V,\dots w_k^V)\).