Gated Recurrent Unit: The Other White Meat

Chuong Ngo
,
Technical Consultant

LSTMs fix vanishing and exploding gradients, but LSTMs are not the only architectures that do that.

The LSTM is one solution to the short-term memory problem that plagues RNNs. By introducing a separate “signal” (i.e., Cₜ) to hold long-term memories, the LSTM cell solves the vanishing and exploding gradient problems and gives the overall network long-term memory. However, the additions of Cₜ and the gates that make the long-term memory scheme work carry a price. Namely, the price for the increased learning capacity of the LSTM cell is a heavier computational burden.

Gated Recurrent Unit

Another approach to fixing vanishing and exploding gradients is the gated recurrent unit (GRU). It is an evolution of the RNN, like the LSTM. GRUs have the benefit of lower computational requirements, relative to LSTMs, because they reduce the number of parameters and needed calculations. Unlike the LSTM cell, GRU cells do not use a separate Cₜ to hold long-term memories. Instead, the hidden state (i.e., hₜ) remembers the long-term memories. Like the LSTM cell, GRU cells also use gates. However, they use fewer gates. GRU cells use reset and update gates. The GRU’s update gate is essentially the LSTM’s forget, input, and update gates merged. In other words, the GRU’s update gate decides what new information to remember. The reset gate decides what memories to forget. Alongside these gates is a line that runs through a tanh to calculate hₜ.

The GRU uses fewer gates than the original LSTM.

The mathematical expressions for the GRU cell are:

rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ + bᵣ)

uₜ = 𝝈(Wᵤₕhₜ₋₁ + Wᵤₓxₜ + bᵤ)

ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ + bₕ)

hₜ = (1-uₜ) × hₜ₋₁ + uₜ × ɣₜ

To understand these equations, let’s walk through how a GRU cell works. bᵣ, bᵤ, and bₕ are biases that are determined and set externally. For the sake of simplicity, we will assume that there are no biases (i.e., bᵣ = 0, bᵤ = 0, bₕ = 0) during our walkthrough. So the equations become:

rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ)

uₜ = 𝝈(Wᵤₕhₜ₋₁ + Wᵤₓxₜ)

ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ)

hₜ = (1-uₜ) × hₜ₋₁ + uₜ × ɣₜ

GRUs use fewer gates than the original LSTM.

A Walkthrough

So xₜ and hₜ₋₁ come into the GRU cell. The reset gate gets copies of xₜ and hₜ₋₁ concatenated, weights it with the reset weights (i.e., Wᵣₕ and Wᵣₓ), and pushes it through a 𝝈. A vector (i.e., rₜ) that determines what memories to forget is the result.

The reset gate decides what memories to forget.

Again, we start with xₜ and hₜ₋₁ coming into the GRU cell and getting concatenated. The update gate receives a copy of that concatenated vector, weights it with the update weights (i.e., Wᵤₕ and Wᵤₓ), and pushes it through a 𝝈. The resulting vector (i.e., uₜ) has the new information to remember. We subtract uₜ from a ones vector and pointwise multiply the resulting vector with hₜ₋₁.

The update gate decides what new information to remember.

Going back to the start again, xₜ and hₜ₋₁ come into the GRU cell. This time, hₜ₋₁ is pointwise multiplied with rₜ, attenuating some of the information in hₜ₋₁. In other words, some of the memories are gone. A tanh takes in the attenuated hₜ₋₁ concatenated with xₜ, and weighted with the hidden weights (i.e., Wₕₕ and Wₕₓ), to produce a candidate hidden state vector (i.e., ɣₜ). We pointwise multiply ɣₜ and uₜ. We take hₜ₋₁ post update gate and add it to the resulting vector. The cell emits the vector resulting from that summation as hₜ and yₜ.

The hidden state is used to carry memory information, there is no separate cell state.

Like LSTMs, GRUs solve exploding and vanishing gradients by not having the gradients go through weights during backpropagation.

The MGU is functionally similar to the GRU except that it combines the reset and update gates to form a single forget gate.

Minimal Gated Unit

The minimal gated unit (MGU) further reduces the number of cell parameters and gates. The equations for the MGU are:

rₜ = 𝝈(Wᵣₕhₜ₋₁ + Wᵣₓxₜ + bᵣ)

ɣₜ = tanh(Wₕₕ(rₜ × hₜ₋₁) + Wₕₓxₜ + bₕ)

hₜ = (1-rₜ) × hₜ₋₁ + rₜ × ɣₜ

MGUs merge the two gates of the GRU to form the forget gate.

So the MGU is functionally similar to the GRU except that it combines the reset and update gates to form a single forget gate. It still disremembers some memories, remembers some new information, and only uses hₜ.


Banner image credit to
passionart
Neural Networks

Related Posts