Link to original paper: https://arxiv.org/abs/1706.03762
Before Transformers, RNNs like LSTM and GRUs have been dominant in sequence-modeling tasks but RNNs process sequences sequentially, which limit parallelization RNNs and CNNs also struggle with handling longer sequences because RNNs suffer from vanishing gradients and lose information from earlier time steps and CNNs require large kernel sizes to capture long-range dependencies between distant patterns in data through convolutional operations. During the time of the publishing of this paper, self-attention has been picking up because they are effective in capturing dependencies in sequences, and they are usually employed alongside recurrent architectures.
So the Transformer model stands out as the first model relying entirely on self-attention for computing representations of both input and output without the use of sequence-aligned RNNs or convolution.
Let’s pretend we are training the Transformer model to translate English sentences to French sentences. Let’s feed the training set of english and french sentence into the model.
Step 1 (Word Vectorisation): convert the input (English sentence) and output (French sentence) sequences using a word-embedding layer into vectors of dimension $d_{model}$. Same weight matrices are used for both the embedding of input and output tokens and the pre-softmax linear transformation layer in the decoder. This allows for a learned shared set of parameters for both token embeddings and final token prediction step in the decoder.
Step 2 (Positional Encoding): because the Transformer is not sequential like an RNN, it has to add positional encoding (PE) within the embeddings itself. Sinusoidal functions (sine and cosine functions) are used to transform the position indices to allow for representation of positional information in a continuous manner