Due to the complexity of RNNs and my impression that the fastai did no cover sufficient formalities (which is normal in a single notebook), I will start with the chapter on RNNs and LSTMs from the book Deep Learning (Ian Godfellow et al) to consolidate the concepts before we jump into the code and case study.
Brief theory on RNNs (source book)
Recurrent neural networks are particularly useful for sequence data, such as text or in general any time series data. One of the first ideas that allow the use of deep and large recurrent neural networks is the fact that parameters are shared across several time steps, allowing for generalization, independent of the exact position of the observation.
Our computational graph (set of computations) will include this time cycles, where the present value is going to affect what will be expected as a realization in the future. More formally:
s(t) = f(s(t-1), theta) = f(f(s(t-2);theta);theta) ... (1)
This can be unfold for many periods creating the computational graph, and external factors can be added:
h(t) = f(h(t-1);x(t);theta) (2)
h(t) behaves as a summary of a fixed length from the past information. We can represented as:
h(t) = g(t)(x(t),x(t-1)...) = f(f(h(t-2);x(t);theta) (3)
Where g can reproduce f applied over different steps, allowing fixed input size and the same transition function f with the same parameters at every step. That means we can use the same model for many time steps and sequence lengths. The most common architecture creates outputs at each step that are evaluated in the loss function and the gradients, which becomes very expensive as it goes backward to the beginning of the sequence. To work around that, we can use as the actual input instead of the predicted (called teacher forcing), which allows for parallelization of the loss function at different time steps.
We can expand RNNs in multiple directions, such as speech recognition bidrectional RNNs help to undertand the context at which the current state will be, or for image processing we can go in the four quadrants of a 2D image, represented each as a state.
Variable sequence to sequence models - Encoder -Decoder Sequence to Sequence Architectures
Deep Recurrent Networks
Recursive Neural Networks
Long term depedencies and memory
Long Short Term Memory and other Gated RNNs
Gated RNNs ensure some gradients are kept fixed or constant for some arbitrary length, to avoid gradients to vanish or explote. Instead of manually deciding at which moment the state have to be cleared, we want the neural network to do it.
The main idea of the LSTMs is to create cells or states outside the hidden units that stores information that a sigmoid / tanh transformation erases or not. Particularly important is the forget cell, that affects the state but not directly the output. This is key to ensure long term dependencies remain available for the network via the state to have a good model. For those familiar with GRU (gated recurrent units) the only difference is that the same cell is used for both the forgetting and update of the state. For practitioners is important to know that at the time of the book writing there was no clear winner, and both are expected to perform similarly well. So far, we showed methods that ensures that long term dependencies are properly used in the model, but the risk of exploding/vanishing gradients is not fully solved. Let's explored available options next.
Optimization methods to avoid exploding/shrinking gradients
Explicty memory
Common Deep Neural networks store implicit knowledge, but lacks a working memory required to achieve some goals and tasks. One of our best attemps is the usage of memory cells in LSTMs and GRUs, as they store past information and manage forgetting in a different way that traditional SGD. Although not covered in this post and the book chapter, but will deep dive in the future is attention, which allows to effectively use memory with different focus on different tasks, allowing for very rapid importance change of the input weights and state.
Comments
Post a Comment