FastAI deeplearning Part 13.1: NLP deep dive, review of RNNs theory

Due to the complexity of RNNs and my impression that the fastai did no cover sufficient formalities (which is normal in a single notebook), I will start with the chapter on RNNs and LSTMs from the book Deep Learning (Ian Godfellow et al) to consolidate the concepts before we jump into the code and case study.

Brief theory on RNNs (source book)

Recurrent neural networks are particularly useful for sequence data, such as text or in general any time series data. One of the first ideas that allow the use of deep and large recurrent neural networks is the fact that parameters are shared across several time steps, allowing for generalization, independent of the exact position of the observation.

Our computational graph (set of computations) will include this time cycles, where the present value is going to affect what will be expected as a realization in the future. More formally:

s(t) = f(s(t-1), theta) = f(f(s(t-2);theta);theta) ... (1)

This can be unfold for many periods creating the computational graph, and external factors can be added:

h(t) = f(h(t-1);x(t);theta) (2)

h(t) behaves as a summary of a fixed length from the past information. We can represented as:

h(t) = g(t)(x(t),x(t-1)...) = f(f(h(t-2);x(t);theta) (3)

Where g can reproduce f applied over different steps, allowing fixed input size and the same transition function f with the same parameters at every step. That means we can use the same model for many time steps and sequence lengths. The most common architecture creates outputs at each step that are evaluated in the loss function and the gradients, which becomes very expensive as it goes backward to the beginning of the sequence. To work around that, we can use as the actual input instead of the predicted (called teacher forcing), which allows for parallelization of the loss function at different time steps.

We can expand RNNs in multiple directions, such as speech recognition bidrectional RNNs help to undertand the context at which the current state will be, or for image processing we can go in the four quadrants of a 2D image, represented each as a state.

Variable sequence to sequence models - Encoder -Decoder Sequence to Sequence Architectures

In those cases where the input and output sequence will differ in length, one can use EDStS architectures where the encoder process the input sequence to emit a context, that the decoder will use to generated the desired length sequence. This have been proven to be very successful, but there are some challenges to capture of important information to the size of the context C. Attention mechanisms have proven to overcome that challenge greatly.

Deep Recurrent Networks

We can represent as a shallow transformation the unfolded graph of an RNN. There is lots of evidence that decomposing the state into more than one layer (therefore adding depth) is benefitial for the model performance. There is a cost into that, as the optimization becomes more challenging ( we are multiplying small or big gradients too many times), which requires skipping connections and other memory mechanisms.

Recursive Neural Networks

Recursive Neural Networks treats the computational graph as a tree instead of a sequence, which sucessful applications in natural language processing and computer vision. It has a clear advantage with respect recurrent nets in terms of the required depth of the network, as it requires many less operations. What remains a question is how to best structure the tree.

Long term depedencies and memory

The fact that RNNs require operations of multiple inputs far from each other mean that we are likely to get exploding or vanishing gradients. To avoid that, we need to carefully design the scaling of the activation functions, so this does not happen on very deep RNNs. But, we cannot simply contraint the values to a region where this gradients remain stable, as this would avoid sufficient robust learning. It is therefore required another set of tricks to overcome the challenge that relatively soon we will barely capture the importance of long term observations.

Long Short Term Memory and other Gated RNNs

Gated RNNs ensure some gradients are kept fixed or constant for some arbitrary length, to avoid gradients to vanish or explote. Instead of manually deciding at which moment the state have to be cleared, we want the neural network to do it.

The main idea of the LSTMs is to create cells or states outside the hidden units that stores information that a sigmoid / tanh transformation erases or not. Particularly important is the forget cell, that affects the state but not directly the output. This is key to ensure long term dependencies remain available for the network via the state to have a good model. For those familiar with GRU (gated recurrent units) the only difference is that the same cell is used for both the forgetting and update of the state. For practitioners is important to know that at the time of the book writing there was no clear winner, and both are expected to perform similarly well. So far, we showed methods that ensures that long term dependencies are properly used in the model, but the risk of exploding/vanishing gradients is not fully solved. Let's explored available options next.

Optimization methods to avoid exploding/shrinking gradients

When applying many times an operation with nonlinear functions, we will tend to have derivatives that are too large or small, looking like a cliff. That creates an optimization problem, as too high learning parameters will jump over the optimal and missed out all the learning that happened.

To avoid that, we can apply cap or clips on the gradient, directly on the batch or do it on the norm of the gradient, just before updating the parameter. Both ways keep the direction almost intact. We are adding a bias with respect traditional stochastic gradient descent, as we are not averaging the step over all mini-batches. In practice the bias should not deviate too much from the "descent" direction.

So far, we find an empirically functioning trick for exploding gradients called cliping. Now we will discussed a solution for vanishing gradients. One solution to avoid forgetting is to ensure the gradient is encourage to keep past information. This is done by constraining the gradient vector magnitude.The only problem is that these approach does not work as well for very large sequences, making LSTMs the best approach to ensure remembering for long stretches of data (I will go deeper in the next post on regularization and skipping units, which is greatly explained in the fastai course).

Explicty memory

Common Deep Neural networks store implicit knowledge, but lacks a working memory required to achieve some goals and tasks. One of our best attemps is the usage of memory cells in LSTMs and GRUs, as they store past information and manage forgetting in a different way that traditional SGD. Although not covered in this post and the book chapter, but will deep dive in the future is attention, which allows to effectively use memory with different focus on different tasks, allowing for very rapid importance change of the input weights and state.

Conclusion

This is a fairly high level summary of RNNs to be able to grasp what we are going to implement in the next post,which aims to show how to implement a language model from scratch using RNNs and LSTMs cells.

Alan Fortuny Sicart

Search This Blog