Attention Is All You Need

Link to original paper: https://arxiv.org/abs/1706.03762

Why and what is the Transformer architecture

Before Transformers, RNNs like LSTM and GRUs have been dominant in sequence-modeling tasks but RNNs process sequences sequentially, which limit parallelization RNNs and CNNs also struggle with handling longer sequences because RNNs suffer from vanishing gradients and lose information from earlier time steps and CNNs require large kernel sizes to capture long-range dependencies between distant patterns in data through convolutional operations. During the time of the publishing of this paper, self-attention has been picking up because they are effective in capturing dependencies in sequences, and they are usually employed alongside recurrent architectures.

So the Transformer model stands out as the first model relying entirely on self-attention for computing representations of both input and output without the use of sequence-aligned RNNs or convolution.

Why Self-Attention?

Untitled

Computational Efficiency
1. Sequential operations: Self-attention has constant ($O(1)$) number of sequentially executed operations but a recurrent layer requires no. of operations linear to sequence length ($O(n)$).
2. Complexity per Layer:
  1. In a self-attention layer, each element attends to all other elements resulting in $n^2$ pairwise interactions. For each interaction, there is a matrix multiplication of dimension $d$ (dimension of input vectors), resulting in $O(n^2 \cdot d)$ complexity per layer
  2. In a recurrent layer, the operations are mainly matrix multiplications between the input and hidden states ($d^2$ complexity). This operation is executed for each element in the sequence, resulting in $O(n \cdot d^2)$ complexity per layer
  3. As long as the sequence length $n$ is smaller than the dimension of input vectors $d$, self-attention layers are faster than recurrent layers (usually the case in sentence representation)
Parallelisation: Due to constant number of sequential operations, self-attention can be highly parallelised. Recurrent layers are inherently sequential in their processing → limits parallelisation.
Path Length between Dependencies: Learning long-range dependencies is crucial in many tasks. Self-attention enables shorter paths that forward and backward signals have to traverse in the network (i.e. shorter distance between supervision and input), facilitating the learning of long-range dependencies. Recurrent and convolutional layers generally have longer dependency paths.

Overview of the Transformer architecture

Untitled

Step-by-step Explanation of How Transformers Work

Let’s pretend we are training the Transformer model to translate English sentences to French sentences. Let’s feed the training set of english and french sentence into the model.

Untitled

Step 1 (Word Vectorisation): convert the input (English sentence) and output (French sentence) sequences using a word-embedding layer into vectors of dimension $d_{model}$. Same weight matrices are used for both the embedding of input and output tokens and the pre-softmax linear transformation layer in the decoder. This allows for a learned shared set of parameters for both token embeddings and final token prediction step in the decoder.

Untitled

Step 2 (Positional Encoding): because the Transformer is not sequential like an RNN, it has to add positional encoding (PE) within the embeddings itself. Sinusoidal functions (sine and cosine functions) are used to transform the position indices to allow for representation of positional information in a continuous manner