Neural Network Compositionality: A Sum of Its Parts

Chuong Ngo
Technical Consultant

Machine learning means that a computer can learn from experience without being specifically programmed. But, is the computer learning the right things?

We have examined how RNNs and other deep learning network architectures work. We looked at what makes them work and how they work. However, we have not asked the big picture question. Do deep learning models work? Of course, we can look at the accuracy numbers for different data sets (e.g., test sets). We can also look at how a model performs in the real world. Yet, the question remains. Does the model work?

Memorization or Generalization

What does it mean to work? If we train a language model, we want that model to pick up on the patterns embedded in our training data and understand them. That way, when the model gets inputs it has never seen before, it can deal with those inputs appropriately. For example, let’s say we train a model to understand natural language directions. In our training data, we tell the model how to go from hotel A to building B, where a conference is. The instructions we give it tells it to use road 1. We also tell it how to reach restaurant C from building B via road 2. So, we explicitly inform the model of the routes. Ideally, the model should figure out how to go from hotel A to restaurant C. That may or may not happen, however.

The network is told how to go from the hotel to the conference, and from the conference to the restaurant. Can it figure out how to go from the hotel to the restaurant?

That is because the deep learning model may not be compositional. Compositionality is where the meanings of the constituents of a complex expression and its structure define the whole. In other words, we can infer the essence of a sentence by looking at the words in that sentence and how they are organized. Humans can find systematic, compositional solutions, while neural networks may not. Neural networks rely on pattern matching and memorization. That limitation means that neural networks need vast amounts of training data and may generalize poorly. So, all the patterns that a neural network needs to learn should be in its training set. Expecting that is unrealistic. Smaller training sets are possible if the neural network learned generalized solutions.

That begs the question, why don’t neural networks learn compositional solutions? It’s because the networks don’t know the type of solutions that they are seeking. For example, reading a lengthy academic paper without sections and an abstract would be laborious. It would be challenging to figure out what to remember and what to ignore. Therefore, you will test poorly on that paper. However, filtering out the noise is painless with more information (e.g., the paper’s topic, the field of research, an abstract, section headers). Then, when tested, you will perform better.

Attentive Guidance

So, neural networks need more information to guide them in deciding what patterns to learn. Hupkes et al. (link) proposed a guidance mechanism to do just that. They added an extra loss term to the loss function:

Compositionality is where the meanings of the constituents of a complex expression and its structure define the whole.

Lₐₜ = 1/T[ ∑ₜ∑ₙ-aᵢₜ[ log(cᵢₜ) ] ]

To achieve neural network composability, we give the network clues about what inputs to pay attention to.

T is the length of the output, N is the length of the input, aᵢₜ is the target attention vector, and cᵢₜ is the attention for token i at time t as computed by the neural network. In other words, each entry in the training dataset has an attentive guidance (AG) vector. The AG vector tells the model what parts of the input should be attended to, in a compositional way, for each output. Therefore, we update the weights with gradients that account for the standard and AG losses. In other words, key inputs affect the weights more. So, the model learns that not all patterns are equally important.

Banner image credit to
Neural Networks

Related Posts