Skip to main content

FastAI deeplearning Part 16: Optimization Process Deep Dive





 In the following post, we will briefly cover the different optimizers to understand what fastai uses to achieve such a good performance in a very small numbers of epochs. 

From Vanilla SGD to RAdam

During stochastic gradient descent, we will update our weights considering the learning rate and the gradient with respect our weights and the loss function as follows:

new_weight = weight - lr * weight.grad


We know from our visualizations of the the loss function, that there are many local minima, at
it is possible for our optimization algorithm to end in one of those. To avoid that we can use momentum,
which simply ensures that the direction of the gradient considers the direction of previous iterations.

weight.avg = beta * weight.avg + (1-beta) * weight.grad
new_weight = weight - lr * weight.avg


It is common to use fairly high momentum, or a beta of around 0.9. Note that fit_one_cycle in fastai
will change the amount of momentum. The next improvement we can use is to adjust the learning rate
for each parameter, estimulating deactivated weights and pull down volatile weights as follows.


w.square_avg = alpha * w.square_avg + (1-alpha) * (w.grad ** 2)
new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)

Note that in that way, high weights are put down (eps is added for numerical stability). This method is
called RMSProp and if we add the "unbiased mean" it will be called Adam, which is the default
in fastai and looks like that:

w.avg = beta1 * w.avg + (1-beta1) * w.grad
unbias_avg = w.avg / (1 - (beta1**(i+1)))
w.sqr_avg = beta2 * w.sqr_avg + (1-beta2) * (w.grad ** 2)
new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)


According to this paper, we need to use a progressive schedule of learning rates, if we want to make
sure the initial learning rate does not affect significantly the loss after many epochs. The way to do
that, which is similarly implemented in fastai fit_one_cycle method, allow us to get fairly good
results for different initial rate, following this warmup and cool down strategy. In the paper,
this is called rectified Adam or Radam.

Implementation notes

We can implement an optimized ourselves or call fit_one_cycle to leverage a fairly robust
optimization workflow. In this notebook, I show how fit_one_cycle is way better that our vanilla, from
the scratch SGD. Note that the notebook contains experiments with all the optimizers and a brief
summary of callbacks for the usage.

This is our vanilla SGD for one step

def sgd_cb(plr, **kwargs): p.data.add_(-lr, p.grad.data)

We add partial to fastai Optimizer class

opt_func = partial(Optimizer, cbs=[sgd_cb])

And we are ready to start training

learn = get_learner(opt_func=opt_func)
learn.fit(30.03)

We can compare this with fit_one_cycle

learn.fit_one_cycle(3, 0.03)
We got ~30% with vanilla SGD and ~60% with fit_one_cycle with 3 epochs and a base learning
rate of 0.003. 

Conclusion


Optimizers are important to achieve high performance in the least amount of time.
To achieve that, we need to add to the vanilla SGD momentum, weight specific learning
rates and a differentiated learning rates and momentum during the training cycle
to ensure we made the best out of each batch. We explained what is inside of
fit_one_cycle (a Radam on steroids! ) so to encourage to use it instead of the standard
fit or other partial optimization algorithm that will not cover everything mentioned
here. Based on my experience I encourage the development effort to be done in
customizations (Moco, Gans, Pix2Pix) rather on reinventing the learning rate policy
over and over.

With this I finish this series I started this year! Here is the following I will
write about:

1) What can ML/AI/Data Science do to make fashion more sustainable?
2) How to bring Deep Learning models in production in the context of MLOPS?
3) How to leverage Transformers and Stable Fusion using Hugging Face?

That will certainly keep me busy! Stay sustainable and never forget to learn!




Comments

Popular posts from this blog

Grundlagenschulung der KO Kapitel 1: Einleitung

In the following I will analyze and comment on the text and training from the KP:   https://kommunistischepartei.de/grundlagenschulung/kapitel-1-einleitung/ Die Welt, in der wir leben, ist eine kapitalistische Welt. Kapitalismus, das bedeutet, dass die gesamte wirtschaftliche Tätigkeit, dass jeder Bereich des gesellschaftlichen Lebens darauf ausgerichtet ist, dass das Kapital möglichst hohe Profite erzielen kann; das bedeutet, dass die Produktionsmittel in den Händen von Wenigen sind und nicht von der Gesellschaft für die Gesellschaft genutzt werden. We live in a world where the dominant system is capitalism, which means that what is produced is decided by the few who own capital, instead for the entirety of the society. While this is true, mostly for the West, let’s not forget the pluriverse of alternatives that still exists and are threatened by capitalism. As well as the non capitalistic ways in which we still handle, no without struggle, some commons and specially public goods...