We have added attention and gates to our neural networks. Now, let’s revolutionize machine learning with memory.
Alex Graves et al. (link) introduced Neural Turing Machines (NTM) to the world in 2014. NTMs are instances of Memory Augmented Neural Networks. Put another way, NTMs are a class of RNNs extended with external working memory to decouple computation and memory. Conceptually, this is not too dissimilar from a computer where the computation unit (i.e., CPU) connects to the memory unit (i.e., RAM) via intermediate circuitry (i.e., northbridge and the relevant buses).
Memory is crucial to the operation of a brain or computer. For example, assume you are processing the sentence “Mary spoke to John”. From that sentence, we can deduce that “Mary” is the subject, “John” is the object, and “spoke to” is the transitive verb. To accurately process that sentence, we need to remember things (e.g., “Mary”). In other words, we need a “working memory” that we can use as scratch paper to write down, keep track of, and apply rules to rapidly-created variables. These variables are short-lived pieces of data that need to be stored for future manipulation or referencing. In other words, you need to remember them for the task at hand, but not after that.
Traditional RNNs and their variants (e.g., LSTM, GRU, MGU) can remember things by encoding the data into their states. That allows them to deal with variable-length data sequences, giving them temporal reasoning abilities. However, that process means that the remembered data affect future tasks to which the data are irrelevant. In other words, the memory is non-volatile. The disadvantages are readily apparent. Imagine a computer system where every program run on it behaves differently depending on the programs run previously. That would be a nightmare for users. So, where should temporary data be stored? In a volatile, working memory (e.g., RAM), of course. That is where the external memory unit of the NTM comes into play.
To those familiar with Turing machines and computers, the high-level architecture of NTMs will be mundane. At the heart of the NTM is a neural network controller that interacts with the outside world and the external memory unit. In other words, the controller (e.g., an LSTM) takes in “user” input, conducts I/O operations on the NTM’s memory via read and write heads, and outputs the result. The memory unit is just an N × M matrix, where M is the length of the data vectors (i.e., embeddings) to be stored, and N is the number of memory locations. So far, not too different from a rudimentary computer. Unlike a computer, however, every NTM component is differentiable (i.e., trainable by gradient descent).
A computer references data in memory and memory locations using addresses. NTMs use “blurry” read and write operations that interact with memory elements to varying degrees. A focusing (i.e., attention) mechanism controls the degree of blurriness for a given I/O operation. So, focusing determine the blast radius of that operation. Focus is accomplished via the read and write heads emitting normalized weightings over the memory locations for each I/O operation.
The writing process is more complicated, consisting of two steps: erase and add.
The process of reading from memory is pretty straightforward. Given the N × M memory matrix previously defined, we read the desired data at time t. The read head emits a vector of weightings over the N memory locations (wₜ) that obeys the following constraint:
∑wₜ(i) =1, 0 ≤ wₜ(i) ≤ 1, ∀i
In other words, the weightings are normalized. With wₜ in hand, reading from memory is as simple as:
rₜ ← ∑wₜ(i)Mₜ(i)
So, rₜ (i.e., the read data vector) is simply the summation of the memory locations read by the read head.
The writing process is more complicated, consisting of two steps: erase and add. These two steps are very intuitive. Before writing something new into given memory locations, what exists at those memory locations must be forgotten. In other words, new data overwrites old data.
Writing data to memory at time t starts with erasing what already exists. The write head emits a vector of weightings over the memory location (wₜ) like the read head does during reading. Additionally, the write head emits an erase vector (eₜ), whose elements range from (0, 1). So the operation is as follows:
Iₜ(i) ← Mₜ₋₁(i)[1-wₜ(i)eₜ]
So the erase vector is weighted and then subtracted from a ones vector. The result is a vector that details what parts of the current memory state to remember. The current memory state is combined with this remembrance vector, yielding an intermediate memory state.
Now on to writing the new data, the simpler of the two steps. The write head produces an add vector (aₜ) and writes it to memory like so:
Mₜ ← Iₜ(i)+wₜ(i)aₜ(i)
So the new memory state is simply the weighted add vector added to the intermediate memory state.
A cursory understanding of the memory addressing mechanism is not sufficient to understand NTMs. However, this post has gotten long enough. So, we’ll save a more in-depth examination for another time.