In the following post, we will briefly cover the different optimizers to understand what fastai uses to achieve such a good performance in a very small numbers of epochs. From Vanilla SGD to RAdam During stochastic gradient descent , we will update our weights considering the learning rate and the gradient with respect our weights and the loss function as follows: new_weight = weight - lr * weight.grad We know from our visualizations of the the loss function, that there are many local minima, at it is possible for our optimization algorithm to end in one of those. To avoid that we can use momentum , which simply ensures that the direction of the gradient considers the direction of previous iterations. weight.avg = beta * weight.avg + (1-beta) * weight .grad new_weight = weight - lr * weight.avg It is common to use fairly high momentum, or a beta of around 0.9. Note that fit_one_cycle in fastai will change the amount of momentum. The next improvement we can use is to adjust the learn
A critical but constructive blog about AI and post growth social systems