FastAI Deep Learning Journey Part 11: Text embeddings for almost any downstream task, or universal language fine tuning!

Natural language processing is booming with thousands of applications with a formidable level of accuracy . In this post we will show how fastai ensures that your language model (and its text embeddings) are sufficiently fine tuned for your downstream task, without the need to use supervised learning. This is going to be more computationally heavy than usual, so when running the nlp repo, make sure you have a GPU and something else to do the couple of hours... let's get started!

Univsersal Language Model Fine-tuning, the paper

Transfer learning on computer vision is well stablished, almost anyone starts now with a pretrained model from imagenet (resnet 18, 34...), and as we have seen in our previous posts, we only have to fine tune the last layers of the network to get state of the art results fine tuning in computer vision.

Fine tunning in NLP is less known, and most either train a relatively mediocre model from scratch, due to corpora and computing limitations, or stay to a model pretrained on wikipedia, which is probably too broad for the specifics of the application.

The following paper explains how to fine tune a pretrained language model on our own corpora in order to get state of the art results in our downstream tasks. The idea is that we should change the weights of the activations not only on new tokens specific to our application, but rather those from the pretrained model too. The discriminative learning applied and the gradual unfreezing is key to explain why the model did not overfit, nor forget essential information from the pretrained model. Contrary to the main research, the paper claims that with less than 1000 labels, the classification task reach state of the art results.

The approach is largely general as:

can learn from most general domain language model
fine tune for different tasks indendepent on document size, and downstream task type
uses a single architecture
requires no custom feature engineering
does not require additional domain data

Despite the availability of transformers and attention models, this is going to be based on a AWD-LSTM model. With more work on the developer side, more sophisticated models can be used, but at the expense of more coding, computation and inherent complexity.

It is important to note that as the paper suggest there is hardly a very general corpora of documents with the same distribution as our domain specific task. It is therefore meaningful to fine tune the language model on the type of text we are going to base our downstream task on.

On top of the specific fine tuning on our documents, the model leverage some tricks to ensure proper adaption of the text without significant loss of what has been learned so far.This is achieved with different learning rates at different layers (as they have different level of abstraction and specific learning required) and also differentiated learning rate per epoch, pushing it high at the beginning and the slowing decaying its value to ensure convergence.

For the downstream task it takes the pooled last hidden layer states. To ensure sufficiently fast but memory safe learning, gradual unfreezing of the layers is performed from the last layers to the first ones. For the classification, the document is divided into fixed batch sizes and at the beginning of each batch, the model is initiaized with the final state of the previous batch.

The model achieve state of the art results and beats many models with much more complex architectures, such as attention models or models fine tuned on millions of documents. In that case a relatively simple LSTM with dropout is used. That approach is better that directly using documents for the downstream task or other fine tuning methods with much more data and compute, making it the best relatively light solution for moderate data and computing instances.

Why fastai should be used at least as a baseline

In order to properly understand the complexity of having a proper encoding of text we should keep in mind the steps we need to performed to have a successful implementation and the challenges.

Tokenization (breaking the text into tokens): we need to separate text items into words in a meaningful way, considering punctuation, upper case, commas and the like.
Numericalization (transforming the tokens into tensors): we need to index each token into a number so our model can process it. We need to save our vocabulary, decide what to do with unfrequent words...
Create batches of text for our language model: while our documents will vary in length, our model expect a fixed length flow of streams of data. For the language model we need to create on the fly our target variable which is the next word/token
Load a pretrained model: we need to load the weights of a model trained of a fairly generic and large corpora of text. it is chanllenging to decide what to use and ensure it will easily allow for fine tunning on our vocab.
Finetune the language model on our own corpora and text. We need to pick and architecture that is able to leverage the pretrained weights and fine tuned those we added for words not present in our pretrained model corporsa.
Finetune a language model for our own downstream task. We need to ensure proper memory managment of the weights learned on the pretrained model while moderate fine tune for our own task. Overfitting or catasthrophic forgetting have to be in place.

Why I would use fastai first:

(1) For tokenization....

uses state of the art tokenizers without changing the API, meaning it will get better without changing our code. In our case Spacy tokenizer.
Adds proven tokenization tricks such as marking upper case, exclamation, repetition in very powerful ways

(2) For the batch creation...

In order to have the text streams prepared for the fine tune of the language model, fastai paralelyze and create fix streams of text and due to its start document token, allows the model to identify when the new text is comming
In the data set used, 25k streams of ~120 tokens took less than a minute on gogle collab pro with one gpu

[ We achieve 1 and 2 with just this code]

dls_lm = DataBlock(
    blocks=TextBlock.from_df('book_desc', is_lm=True),get_x=ColReader('text'))

dls_lm = dls_lm.dataloaders(df, bs=64)
dls_lm.show_batch(max_n=2)

(3) Loading a pretrained model...

Here we are using an RNN trained on the wikipedia, but there is the option to load several pretrained models like the ones from HuggingFace and in general Transformers: Transformers Tutorial

(4) Fine tune the pretrained model on our vocab...

Here we are first updating the weights of the new tokens and later training the whole language model, with proper dropout usage to avoid forgetting of the pretrained model and explosive gradients, tipical on RNNs

[we achieve 3 and 4 with this code]

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

(5) Fine tune our classifier using our fine tuned encoder embeddings

Here we simply have to save the encoder from the previous learn object, that will be used and predictors of our downstream task
We create a new text data loader that in this case creates batches of different dimension, as the documents used will certainly differ in size
We use gradual unfreeze and differentiated learning rates to achieve state of the art results

[ we achieve 5 with this code]

learn.save_encoder('/content/gdrive/MyDrive/NLP/finetuned')

imdb_clas = DataBlock(
    blocks=(TextBlock.from_df('book_desc', seq_len=72,vocab=dls_lm.vocab), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('book_rating'))
dls_clas = imdb_clas.dataloaders(df, bs=64)

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 

                                metrics=accuracy).to_fp16()

learn = learn.load_encoder('/content/gdrive/MyDrive/NLP/finetuned')

learn.fit_one_cycle(1, 0.001)

learn.freeze_to(-2)

learn.fit_one_cycle(1, slice(0.001/(2.6**4),0.001))

learn.unfreeze()

learn.fit_one_cycle(1, slice(0.001/(2.6**4),0.001))

Concluding Remarks

Fastai provides a simple, fast and close to the actual state of the art performance when trying to get text embeddings for our downstream tasks. In his paper, they prooved that fine tunning for the specific corpora of the application and sensible usage of learning rates and dropouts are essential to make the best of the pretrained and fine tuned language model.

In less than 3 hours of training we were able to classify successful books with 95% accuracy based on their product descriptions, using fine tune text embeddings. This is certainly going to be useful for regression and recommender system problems, as the encoder can be extracted and used for other downstream task problems or simply leverage distances from the embeddings generated.

Alan Fortuny Sicart

Search This Blog