Artificial Intelligence

March 11, 2021

LSTMs fix vanishing and exploding gradients, but LSTMs are not the only architectures that do that.

The LSTM is one solution to the short-term memory problem that plagues RNNs. By introducing a separate “signal” (i.e., *Cₜ*) to hold long-term memories, the LSTM cell solves the vanishing and exploding gradient problems and gives the overall network long-term memory. However, the additions of *Cₜ* and the gates that make the long-term memory scheme work carry a price. Namely, the price for the increased learning capacity of the LSTM cell is a heavier computational burden.

Another approach to fixing vanishing and exploding gradients is the gated recurrent unit (GRU). It is an evolution of the RNN, like the LSTM. GRUs have the benefit of lower computational requirements, relative to LSTMs, because they reduce the number of parameters and needed calculations. Unlike the LSTM cell, GRU cells do not use a separate *Cₜ* to hold long-term memories. Instead, the hidden state (i.e., *hₜ*) remembers the long-term memories. Like the LSTM cell, GRU cells also use gates. However, they use fewer gates. GRU cells use reset and update gates. The GRU’s update gate is essentially the LSTM’s forget, input, and update gates merged. In other words, the GRU’s update gate decides what new information to remember. The reset gate decides what memories to forget. Alongside these gates is a line that runs through a *tanh *to calculate *hₜ*.

The GRU uses fewer gates than the original LSTM.

The mathematical expressions for the GRU cell are:

*rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ + bᵣ)*

*uₜ = 𝝈(Wᵤₕhₜ₋₁ + Wᵤₓxₜ + bᵤ)*

*ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ + bₕ)*

*hₜ = (1-uₜ) × hₜ₋₁ + uₜ × ɣₜ*

To understand these equations, let’s walk through how a GRU cell works. *bᵣ*, *bᵤ*, and *bₕ* are biases that are determined and set externally. For the sake of simplicity, we will assume that there are no biases (i.e., *bᵣ = 0*, *bᵤ = 0*, *bₕ = 0*) during our walkthrough. So the equations become:

*rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ)*

*uₜ = 𝝈(Wᵤₕhₜ₋₁ + Wᵤₓxₜ)*

*ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ)*

*hₜ = (1-uₜ) × hₜ₋₁ + uₜ × ɣₜ*

So *xₜ* and *hₜ₋₁* come into the GRU cell. The reset gate gets copies of *xₜ* and *hₜ₋₁* concatenated, weights it with the reset weights (i.e., *Wᵣₕ* and *Wᵣₓ*), and pushes it through a *𝝈*. A vector (i.e., *rₜ*) that determines what memories to forget is the result.

Again, we start with *xₜ* and *hₜ₋₁* coming into the GRU cell and getting concatenated. The update gate receives a copy of that concatenated vector, weights it with the update weights (i.e., *Wᵤₕ* and *Wᵤₓ*), and pushes it through a *𝝈*. The resulting vector (i.e., *uₜ*) has the new information to remember. We subtract *uₜ* from a ones vector and pointwise multiply the resulting vector with *hₜ₋₁*.

Going back to the start again, *xₜ* and *hₜ₋₁* come into the GRU cell. This time, *hₜ₋₁* is pointwise multiplied with *rₜ*, attenuating some of the information in *hₜ₋₁*. In other words, some of the memories are gone. A *tanh* takes in the attenuated *hₜ₋₁* concatenated with *xₜ*, and weighted with the hidden weights (i.e., *Wₕₕ* and *Wₕₓ*), to produce a candidate hidden state vector (i.e., *ɣₜ*). We pointwise multiply *ɣₜ* and *uₜ*. We take *hₜ₋₁* post update gate and add it to the resulting vector. The cell emits the vector resulting from that summation as *hₜ* and *yₜ*.

Like LSTMs, GRUs solve exploding and vanishing gradients by not having the gradients go through weights during backpropagation.

The MGU is functionally similar to the GRU except that it combines the reset and update gates to form a single forget gate.

The minimal gated unit (MGU) further reduces the number of cell parameters and gates. The equations for the MGU are:

*rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ + bᵣ)*

*ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ + bₕ)*

*hₜ = (1-rₜ) × hₜ₋₁ + rₜ × ɣₜ*

So the MGU is functionally similar to the GRU except that it combines the reset and update gates to form a single forget gate. It still disremembers some memories, remembers some new information, and only uses *hₜ*.

Banner image credit to

passionart