Applying the attention principle to more tasks meant that the attention mechanism had to be modified.
The attention mechanism proposed by Bahdanau, Cho, and Bengio, while showing significant improvement over a standard recurrent neural network for machine translation tasks, struggle with more complex tasks like document classification or other tasks that depend on complex/indirect relationships (e.g. temporal reasoning). For such tasks, more complex attention mechanisms are needed.
What follows are brief descriptions of the four variations on the attention mechanism that have been developed: multi-dimensional, hierarchical, self, and memory-based. These variations are all generalizable; task specific mechanisms are not discussed. They are presented in no particular order.
Instead of using a scalar to represent the attention score of a term, like how the basic form of attention, multi-dimensional attention computes a 2D vector “attention” score for each term in the input. It does this by gluing together multiple scalar values. The resulting 2D vector captures the interactions between terms along multiple, different representation spaces since each scalar value is an interaction in a single dimension. An example of multi-dimensional attention in action is provided by Wang, Cao, Melo, and Liu (link):
“Fizzy drinks and meat cause heart disease and diabetes.”
The resulting contextual vectors have more relational information that we can then use.
In the given example, there is a relationship between “drinks” and “diabetes”. However, because of the distance between the two tokens, more traditional neural network approaches are unable to capture this relationship. More advanced methods such as using LSTM models, and dependency-tree based RNN designs can be used, but they still require dependency parsing or the training of multiple models. Therefore, these approaches can be resource and/or compute intensive, limiting the environments in which they can be used. Using a multi-dimensional attention mechanism, we first capture the relevance of each word in the input with each other. We then enrich the output of that attention mechanism with other contextual information (e.g. relevant n-gram information obtained from a corpus). The resulting contextual vectors have more relational information that we can then use. This means our system can handle tasks which require a more complex understanding of the input without needing additional contextual structures like an ontology.
A shortfall of the attention mechanisms discussed thus far is that they do not work well for inputs larger than sentences, like documents. This is where hierarchical attention comes in. Take the following example document:
How do I get rid of all the old web searches I have on my web browser? I want to clean up my web browser. Go to tools -> options, then click “delete history” and “clean up temporary internet files”.
The goal is to be able to classify the above example into the appropriate document category (e.g. Computer and Internet). Classification is done by first looking for terms that can provide clues as to how to classify. For example, the terms “web searches” and “web browser” can be commonplace in computer-related documents. Additionally, “how to get rid of” can clue us in as well because non-professionals unfamiliar with a piece of computer software may look for a “how to” guide. So in order to accurately classify the document, we only need to pay attention to the relevant parts of it. Additionally, the relevant terms increase the classification value of the sentences they are in. So relevant terms feed into relevant sentences that can then feed into relevant paragraphs and so on. We see a nested or hierarchical structure.
We see a nested or hierarchical structure.
So, to model this hierarchical structure, we simply use multiple encoders and attention layers at different levels (e.g. a word-level encoder and attention layer, a sentence-level encoder and attention layer, etc…) to ascertain the relevant parts of the document with each layer utilizing the information provided by previous layers. In the example of a bottom-up hierarchical approach (link), we figure out which words in a sentence are relevant to the meaning of the sentence. We then figure out which sentences are relevant to the meaning of the paragraph. Repeat as needed until we can classify the document.
Unlike the basic attention mechanism which looks at the input to determine relevance, self attention looks “inward” at its memory. For example, given the token t in the input sentence s, when processing t, the basic attention mechanism looks at s to find tokens that are relevant to t. Self attention looks at the memory of the system to determine the relevant tokens. This is like how we, when processing a given stimuli, will search our memory for relevant information (i.e. memories). When we find a relevant memory, we pull up not only that memory to inform the processing of our stimuli, but other memories that informed that memory. Thus, we have indirect relevance. Since the indirectly relevant memories may not be as relevant to the relevant memories in relation to our current stimuli, we may put less weight on them.
This gave the machine reader the ability to perform shallow structure reasoning over the input streams as it incrementally read them, simulating human behavior.
Self attention was applied to the task of machine reading by Jianpeng Cheng, Li Dong, and Mirella Lapata in their paper “Long Short-Term Memory Networks for Machine Reading” (link). This gave the machine reader the ability to perform shallow structure reasoning over the input streams as it incrementally read them, simulating human behavior.
Memory-based attention is, in a manner of speaking, related to self attention in that they both operate on memory. As put forward by Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, and Socher in their paper “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing” (link), memory-based attention was applied to the task of question answering, adding temporal reasoning. Take the following example put forward by Kumar et al.:
The iterative episodic memory module is where the magic happens.
I: Jane went to the hallway.
I: Mary walked to the bathroom.
I: Sandra went to the garden.
I: Daniel went back to the garden.
I: Sandra took the milk there.
Q: Where is the milk?
“Milk” and “garden” are not directly linked in a single input sentence. However, they are indirectly linked by “Sandra”. Therefore, by looking at the inputs that contain “Sandra”, we can reason where “milk” is.
The architecture put forward by Kumar et al. for question answering consists of 4 modules: the input, question, episodic memory, and answering module. The iterative episodic memory module is where the magic happens. Starting with the question, the episodic memory iterates over the inputs (I) and identifies the relevant input sentences. When relevant input sentences are identified, the system identifies additional information that can be "reasoned" from the input sentences for relevance matching. The input sentences are scanned again for relevant information with the additional information identified from the previous pass. This process is repeated as needed until an answer is arrived at.