Transformers: An Unexpected NLP Revolution

Chuong Ngo
,
Technical Consultant

When you need to go faster, but RNNs cannot keep up. Whom do you call? Transformers, of course.

Google’s Bidirectional Encoder Representations from Transformers (BERT) has taken the NLP world by storm. It has broken numerous records, and there seems to be no stopping it. BERT uses the Transformer. The Transformer is a paradigm shift in NLP modeling. Introduced in December of 2017 by Vaswani et al. (link), the Transformer upends what we know about NLP models. It jettisons traditional beliefs of NLP model design and uses a new architecture. I’ve wanted to talk about the Transformers for some time now. However, I needed to spend some time reviewing foundational knowledge and traditional architectures (i.e., my previous posts on artificial intelligence). Now with the review finished, let’s talk about this new, exciting architecture.

No Recursion, No Problem

The traditional kings of NLP modeling have been some flavor of the RNN (e.g., LSTM). Sentences are sequential data structures. The word at time t is dependent on some of the words preceding it to provide context for its meaning, linguistic form, and more. With its recursive design, the RNN seems like a natural fit to work with text. RNNs encode history into its hidden state. A timestep's hidden state goes onto the following timestep. There, it affects the hidden state calculation for that timestep. In other words, hₜ is dependent on hₜ₋₁. So one cannot calculate hₜ without already having computed hₜ₋₁. To determine hₜ₋₁, one needs to work out hₜ₋₂. On and on this dependency chain goes until you reach h₁. So, RNNs cannot use parallelization to go faster. Consequently, long sentences also become a problem because they hit memory limits.

The recursive nature of RNNs make them a natural fit for sequential data.

However, now you feel the need… the need for speed. Transformers forgo recursion entirely. They, instead, rely on self-attention. As a result, sequential computation is unnecessary. That, in turn, means that parallelization is possible, which equates to a faster model architecture.

The Attention of it All

As the title states, attention is all you need. Self-attention is the Transformer’s secret sauce. Vaswani et al. called their flavor of attention “Scaled Dot Product Attention”. As discussed in a previous post (link), self-attention is when the model looks at its inputs (e.g., sentence) to determine what parts are relevant to other parts. For example, given the input sentence “a quick brown fox”, the words “quick” and “brown” are pertinent to “fox”. “Thomas thought the appointment was tomorrow, so he slept in” is another example. In this example, “Thomas” is relevant to “he”. The second example showcases a long-range dependency. Unlike the recursive methods utilized by RNNs, attention has an infinite range for dependencies.

Transformers forgo recursion entirely.

The formula for scaled dot product attention is:

Attention(Q, K, V) = softmax[(QKᵀ) / √dₖ]V

The attention function maps a query to key-value pairs to produce a weighted sum of the values. The weight for each value is the relevance of the value’s corresponding key to the query. The inputs to the function are the query (Q), key (K), and value (V) matrices. Each matrix contains the word embeddings for their respective contents. Q has the queries (i.e., the current word) packed together in a matrix. K contains the keys (i.e., all the words in the sentence). V holds the values (i.e., all the words in the sentence). K and V are packed similarly to Q. So, the attention function takes Q and the transpose of K (i.e., Kᵀ) and calculates their dot product. The resulting matrix is how much each word influences both itself and other words.

Self-attention is the Transformer’s secret sauce.

That matrix multiplication may result in some large values. These large values are problematic for the neural network parts of the system. So, we normalize the matrix by multiplying it with the root of the size of the query and key matrices, inverted (i.e., 1/√dₖ). We then squash the scaled dot product to the range of 0 and 1 with a softmax. After all that, we get an “attention weight” matrix that we apply to V. The result is a matrix of encodings with the attention information baked in.

But There’s More

So, we have not talked about the architecture of the Transformer yet. But doing that in this post will make it quite lengthy. We have set a solid foundation to understand Transformers in this post. So, why not save the architecture discussion for next time?


Banner image credit to
phonlamaiphoto
Natural Language Processing
Neural Networks

Related Posts