The attention mechanism has become widespread in neural NLP modeling, but where did it come from?
Artificial neural networks (ANN) have seen significant uptake in the various fields of natural language processing, including machine translation. However, the popular encoder-decoder style recurrent neural networks (RNN) that are often used compress the input, along with the relevant context, into a fixed-size vector representation in order to predict the output. Along with the RNNs bias towards context close to the output, this makes RNNs unsuitable for working with long sentences. To address the issues with RNNs, Bahdanau, Cho, and Bengio proposed to augment with RNNs with an attention mechanism extension in their 2016 paper “Neural Machine Translation By Jointly Learning To Align and Translate” (link), focusing specifically on the task of machine translation. The extension is added to the decoder part of the RNN to allow it to use a distinct context vector, along with the input (i.e. previous target words), for each target word (i.e. the word to be translated to). In other words, each target word uses the input and a context vector comprised of the relevant parts in the source sentence for its prediction.
The annotations for the source sentence are arrived at by using a BiDirectional RNN (BiRNN). BiRNNs are essentially a combination of two RNNs, a forward RNN and a backward RNN. The forward RNN reads the input in a forward direction (left-to-right for English) and calculates a sequence of forward hidden states (FHS). The backward RNN reads the input in a reverse direction and calculates a sequence of backward hidden states (BHS). Then, for each word in the source sentence, sᵢ, the annotation for that word, hᵢ, is obtained by concatenating the FHS and BHS from their respective beginnings to sᵢ. Thus, hᵢ can be thought of as an embedding for sᵢ in the context of the source sentence. The annotation hᵢ contains contextual summaries for words that precede and follow sᵢ focusing on the words around sᵢ.
The alignment model allows the overall model to have attention whereby it can selectively ignore or pay attention to different parts of the context.
The alignment model put forward by Bahdanau et al. is a feedforward model that calculates the expected annotation for the target word by taking a weighted sum of all the annotations. The weight applied to each annotation h is the relevance of h to predicting the target word. Thus, the alignment model allows the overall model to have attention whereby it can selectively ignore or pay attention to different parts of the context or source sentence.