FastAI deeplearning Part 13.2: NLP deep dive, implementing poetry model from scratch

After getting a theoretical understanding of RNNs and LSTMS in the previous post, we review the theoretical foundations of RNNs, its challenges, and contributions from LSTMs, a foundational work for Transformers and Attention models, which can be implemented with Hugging Face Tutorial.

Although one would probable, and should not as shown is this post, train from scratch RNNs or LSTMs if the goal is to reach close to the state of the art quality results, understanding the foundations and innovations that has been implemented from a shallow RNN to a LSTM with regularization and dropout make the understanding of attention models architecture easier.

In this post we try to implement from scratch a poetry model after learning from all the bibliography of the great spanish poet Garcia Lorca, killed by the fascists after the civil war. He is without any doubt, one of the most remarkable poets in spanish history and the XX century, this is my humble homenage to him.

First part - loading the text

I search from some open source work that contains all his poems in a single pdf, which I was able to process easily:

from tika import parser # pip install tika
raw = parser.from_file('federico-garcc3ada-lorca-obras-completas.pdf')
txt = raw['content']

To make the notebook maneageable in terms of length, I barely use any tokenization, only remove symbols such as  '[' or '(' to avoid the prediction of those.

I mainly create sequences of 40 word tokens to predict the next word. One can try the length from the fastai course of 3 tokens but on that data is just too short to learn something (at least on the shallow model without state).

Second part, modular addition of enhacements over the shallow RNN


As the fastai course, nicely did in chapter 12, I create a based model without keeping the state at the beginning so we can add the following improvements to the model and see their value to the baseline accuracy:
To manage memory...We add a state so we do not loose the state information after a new sequence is passed
To avoid back to start backpropagation... We detach the state after we go over all tokens to avoid the gradients to be calculated back to the first sequence, speeding up processing
To increase the feedback to the model...We add more signal creating outputs of every next word so the model does not only the 41th token to calculate the loss and gradients
To improve model performance we go deeper...We add more layers of RNNs so more complex functions can be modelled, in order work we create a deeper rnn network
To avoid forgetting and vanishing gradients...We add LSTM cells (forget, input) to ensure there is no incorrect forgetting or vanishing gradients
To avoid overfitting... We add Dropout and regularization

That gives us from a mere 4% of accuracy to 12% of accuracy, considering the complexity of poetry model this is quite good. Many language models reach 25% of accuracy with a very vast corpus, and prose and poetry are different, there is less space for patterns and strict structures in the latter. There is one thing we try at last which more than doubles the performance of the model.

Third Part: Transfer Learning on the same architecture, so called AWD-LSTM
As it is normal in my work experience but also in the fastAI course, we should consider using a pretrained model when available and fine tune the weights on the model of the actual and new tokens, over the corpora with are interested. Following the Universal Language Fine Tunning framework ULFT , we used a model trained with Wikipedia and Fine Tuned for Lorca Corpora.

After doing that we reached 23% of accuracy on the prediction of the next token (the baseline model got 5%), and the model is able to create poetry as the ones following:

TEXT = 'Viva el mar, las estrellas. Los horizontes infinitos y la lluvia suave.' 
N_WORDS = 40
N_SENTENCES = 5
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
preds

'Viva el mar , las estrellas . Los horizontes infinitos y la lluvia suave . Tierra desgarradas como camelias grises . Por donde viven los rebaños sin raíces . Tierra , tierra sin ruido . Tierra de tierra . Tierra sin tejados . Tierra de tierra . Tierra'

'Viva el mar , las estrellas . Los horizontes infinitos y la lluvia suave . Agua y espuma , y ceniza de ceniza . Hoy musgo sobre las ondas estrellas . Agua sobre las olas . Agua estancada a los álamos . Fuente de la seda negra . Agua'

In order to get even better performance, one can do the following:

Using GANNs for text generation
Using HuggingFace Pretrained Transformers over the same ULFT framework on Lorca's Corpora

I leave that to the reader, since the goal of the post to share and test RNNS and LSTM on a hard data

sets is more than achieved!

Alan Fortuny Sicart

Search This Blog

FastAI deeplearning Part 13.2: NLP deep dive, implementing poetry model from scratch

First part - loading the text

Second part, modular addition of enhacements over the shallow RNN

Third Part: Transfer Learning on the same architecture, so called AWD-LSTM

Comments

Post a Comment

Popular posts from this blog

Radical Generosity: An Ecosocialist Manifesto