Skip to main content

FastAI deeplearning Part 13.2: NLP deep dive, implementing poetry model from scratch


After getting a theoretical understanding of RNNs and LSTMS in the previous post, we review the theoretical foundations of RNNs, its challenges, and contributions from LSTMs, a foundational work for Transformers and Attention models, which can be implemented with Hugging Face Tutorial

Although one would probable, and should not as shown is this post, train from scratch RNNs or LSTMs if the goal is to reach close to the state of the art quality results, understanding the foundations and innovations that has been implemented from a shallow RNN to a LSTM with regularization and dropout make the understanding of attention models architecture easier.

In this post we try to implement from scratch a poetry model after learning from all the bibliography of the great spanish poet Garcia Lorca, killed by the fascists after the civil war. He is without any doubt, one of the most remarkable poets in spanish history and the XX century, this is my humble homenage to him.

First part - loading the text

I search from some open source work that contains all his poems in a single pdf, which I was able to process easily: 

from tika import parser # pip install tika
raw = parser.from_file('federico-garcc3ada-lorca-obras-completas.pdf')
txt = raw['content']

To make the notebook maneageable in terms of length, I barely use any tokenization, only remove symbols such as  '[' or '(' to avoid the prediction of those.
I mainly create sequences of 40 word tokens to predict the next word. One can try the length from the fastai course of 3 tokens but on that data is just too short to learn something (at least on the shallow model without state).

Second part, modular addition of enhacements over the shallow RNN


As the fastai course, nicely did in chapter 12, I create a based model without keeping the state at the beginning so we can add the following improvements to the model and see their value to the baseline accuracy:
  1. To manage memory...We add a state so we do not loose the state information after a new sequence is passed
  2. To avoid back to start backpropagation... We detach the state after we go over all tokens to avoid the gradients to be calculated back to the first sequence, speeding up processing
  3. To increase the feedback to the model...We add more signal creating outputs of every next word so the model does not only the 41th token to calculate the loss and gradients
  4. To improve model performance we go deeper...We add more layers of RNNs so more complex functions can be modelled, in order work we create a deeper rnn network
  5. To avoid forgetting and vanishing gradients...We add LSTM cells (forget, input) to ensure there is no incorrect forgetting or vanishing gradients
  6. To avoid overfitting... We add Dropout and regularization

That gives us from a mere 4% of accuracy to 12% of accuracy, considering the complexity of poetry model this is quite good. Many language models reach 25% of accuracy with a very vast corpus, and prose and poetry are different, there is less space for patterns and strict structures in the latter. There is one thing we try at last which more than doubles the performance of the model.

Third Part: Transfer Learning on the same architecture, so called AWD-LSTM

As it is normal in my work experience but also in the fastAI course, we should consider using a pretrained model when available and fine tune the weights on the model of the actual and new tokens, over the corpora with are interested. Following the Universal Language Fine Tunning framework ULFT , we used a model trained with Wikipedia and Fine Tuned for Lorca Corpora.

After doing that we reached 23% of accuracy on the prediction of the next token (the baseline model got 5%), and the model is able to create poetry as the ones following:

TEXT = 'Viva el mar, las estrellas. Los horizontes infinitos y la lluvia suave.' 
N_WORDS = 40
N_SENTENCES = 5
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
preds

'Viva el mar , las estrellas . Los horizontes infinitos y la lluvia suave . Tierra desgarradas como camelias grises . Por donde viven los rebaños sin raíces . Tierra , tierra sin ruido . Tierra de tierra . Tierra sin tejados . Tierra de tierra . Tierra'

'Viva el mar , las estrellas . Los horizontes infinitos y la lluvia suave . Agua y espuma , y ceniza de ceniza . Hoy musgo sobre las ondas estrellas . Agua sobre las olas . Agua estancada a los álamos . Fuente de la seda negra . Agua'


In order to get even better performance, one can do the following:
  • Using GANNs for text generation
  • Using HuggingFace Pretrained Transformers over the same ULFT framework on Lorca's Corpora
I leave that to the reader, since the goal of the post to share and test RNNS and LSTM on a hard data
sets is more than achieved!

Comments

Popular posts from this blog

Anàlisi dels partits respecte a la transició energètica a Catalunya: 12M

  En aquest article faig un analisi de les posicions dels diferents partits publics per afrontar un dels majors reptes de la nostra societat, com mantenir un metabolisme social necesari sense combustibles fosils, conflictes ambientals i pobresa energetica. El debat en la seva totalitat es pot trobar aqui . El rol de l'energètica pública Amb l'excepció de Junts i Ciutadans, tots els partits estan d'acord en donar més pes a l'energia pública, tot i que amb matisos en els seus objectius. La CUP, ERC i els Comuns clarament volen reduir el pes dels monopolis energètics, com és el cas flagrant de la distribució amb Endesa, que controla el 98% de la distribució, així com també de la generació, amb una dependència actual insostenible de l'energia nuclear i dels combustibles fòssils, si considerem la capacitat instal·lada renovable al territori (menys del 15%). Sorprèn especialment a la dreta i al centre la manca de comprensió que el sector energètic no pot dependre de l'