Memory is essential to NTMs. So, understanding NTMs’ memory addressing system is a must.
LSTMs have achieved state-of-the-art performance in many commercially important sequence learning tasks (e.g., machine translation and speech recognition). NTMs have outperformed LSTMs on some sequence learning tasks that require large memory and/or complicated memory access patterns (e.g., graph traversal). Memory is crucial to NTMs. So, understanding how NTMs use and address memory is critical to understanding NTMs. Last time, we looked at the overall architecture of NTMs (link). This time, we’ll continue our exploration of NTMs by looking at its memory addressing system.
Let’s quickly recap how NTMs conduct memory I/O operations. Blurry I/O operations are executed via an attention mechanism using read/write heads. These operations can operate on a single memory location or multiple memory locations to varying degrees. During I/O operations, the read/write heads emit and use a normalized weight vector (wₜ). The read head reads the memory like so:
rₜ ← ∑wₜ(i)Mₜ(i)
The write head does write operations in two steps: erase and add.
Iₜ(i) ← Mₜ₋₁(i)[1-wₜ(i)eₜ]
Mₜ ← Iₜ(i)+wₜ(i)aₜ(i)
NTMs address their memory via wₜ. The weight vector, emitted by the read/write heads, allows NTMs to do two kinds of addressing: content-based and location-based. In content-based addressing, NTMs look for relevant memory contents by comparing each memory entry against the current input, looking for similarity. The equations are:
Content-based addressing is a fuzzy memory matching mechanism.
ŵₜ ← [ exp( 𝛽ₜK[ kₜMₜ(i) ] ) ] ÷ [ ⱼ exp( 𝛽ₜK[ kₜMₜ(j) ] ) ]
K[u,v] = (u · v) ÷ (||u|| · ||v||)
Location-based addressing allows for simple iteration and random-access jumps over memory locations via rotational shifts of the weightings.
For content-based addressing, the read/write heads produce a key vector (kₜ). kₜ is the vector to compare to all memory entries. It is the information we are looking for encoded as an M length vector. So using kₜ, we can look through the memory to find similar entries. 𝛽ₜ is the key strength that determines the level of similarity that kₜ and a given memory entry must have to be considered comparable. In practice, 𝛽ₜ controls the level of focus (i.e., the net size) considered when looking for matches. Cosine similarity is what compares kₜ and a given memory entry. So, content-based addressing is a fuzzy memory matching mechanism.
NTMs also use location-based memory addressing. Location-based addressing allows for simple iteration and random-access jumps over memory locations via rotational shifts of the weightings. For example, assuming that the current weighting focuses the NTM on just one memory location, shifting by one moves the focus to the following memory location. Inversely, a shift of negative one moves the focus to the previous memory location.
Before rotating wₜ, the read/write head emits an interpolation gate value between 0 and 1 (gₜ). gₜ blends wₜ₋₁ and ŵₜ. In other words, it adds contextual information from the previous focus to the current focus, allowing NTMs to account for past knowledge. Conceptually, this is not too dissimilar from gates in LSTMs and GRUs.
wᵢₜ ← gₜŵₜ + (1-gₜ)wₜ₋₁
The process starts with the read/write head emitting a content weighting vector (ŵₜ) for the current timestep. An interpolation gate value combines ŵₜ with the focus weighting vector used in the previous timestep (wₜ₋₁) via the interpolation gate value. If gₜ = 0, then ŵₜ is neglected, and wᵢₜ = wₜ₋₁. Else, if gₜ = 1, then wₜ₋₁ is completely ignored, and wᵢₜ = ŵₜ. In short, this interpolation process controls how much the previous search content should affect the memory addressing. For example, given the task x+y, we only care about adding the values at the memory locations x and y. We do not care about what the values are. In this case, we would want to ignore ŵₜ.
After interpolation comes convolution, where the read/write head emits a shift weighting vector that defines a normalized distribution over the integer shifts (sₜ). For example, given integer shifts between -1 and 1, sₜ has elements corresponding to the degree to which the shifts happen (e.g., -1, 0, and 1). Finally, we combine sₜ with wᵢₜ over all the memory locations.
ẁₜ(i) ← wᵢₜ(j)sₜ(i-j)
In short, the convolution process shifts the focus by shifting all elements in the interpolated weighting (e.g., wᵢₜ(j) ← wᵢₜ(j+ 3)).
NTM’s complete addressing system is a combination of content-based and location-based addressing schemes and can operate in three modes.
The convolution process acts like a convolution blurring filter. So, a side effect of the convolution process is a blurring of focus. After all, convolution is the process of adding each element in a matrix with its local neighbors. To counteract this blurring, NTMs apply a sharpening mechanism using a sharpening parameter (ɣₜ) emitted by the read/write head.
wₜ(i) ← [ ẁₜ(i)ʸᵗ ] ÷ [ ⱼẁₜ(j)ʸᵗ]
So after that whole process, we get the final weighting for the timestep.
NTM’s complete addressing system is a combination of content-based and location-based addressing schemes and can operate in three modes. In the first mode, wₜ is determined solely by the content system. In the second mode, the content system creates a weighting, and the location system shifts it. Thereby, the focus is moved to locations next to, but not on any addresses addressed by the content system. In the third mode, we rotate a weighting from the previous timestep without any input from the content system.
Let’s review the addressing mechanism by walking through it. The read/write head emits kₜ and 𝛽ₜ. Content addressing takes in both kₜ and 𝛽ₜ, along with the memory (Mₜ), and produces ŵₜ. The next three steps (i.e., interpolation, convolutional shift, and sharpening) make up the location system. Interpolation takes in ŵₜ, along with gₜ (emitted by the read/write head) and wₜ₋₁, and produces wᵢₜ. Then, the convolutional shift takes in wᵢₜ, along with sₜ emitted by the read/write head, and generates ẁₜ. Finally, sharpening ingests ẁₜ and ɣₜ and outputs wₜ.