FastAI Deep Learning Journey Part 8: How to implement SOTA models for image classification with fastAI

In the following post, we will explore some of the tricks to achieve the best possible performance for a given data set and architecture. We will learn some additional important steps on data augmentations on training and test time, which is essential to generalize well.

Our analysis shows that we go from 81% of accuracy on the baseline model to 92% applying a combination of mixup augmentation (random linear combination of images) and augmentation at test time, after 50 epochs.

Using a baseline data set

When testing and idea or understanding certain technique, it is good to test them on common data sets such as ImageNet, CIFAR or MNIST. The first one is the most common data set used for benchmarks, and it is rightfully so, due to the variety and complexity of the image repository (>1M images and 1000 classes of very different sizes, rotation, objects...). The problem of that dataset is due to the size, training would likely going to take several days with common hardware settings.

The FastAI folks create a smaller version of the data set called Imagenette as a subset of the actual Imagenet, to allow reseachers and practitiones to test their ideas with little computing and fast feedback. To their surprise, what work very well on this Imagenette tend to work also well on Imagenet, making it very useful to find generalizable breakthroughts. Let's explore some of the tricks and key ideas to improve our model and get state of the art performance on Imagenette.

Running a model from scratch

We will use the fastAI API to import Imagenette and train a model from scratch using resizing:

from fastai.vision.all import *

path = untar_data(URLs.IMAGENETTE)

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),

                   get_items=get_image_files,

                   get_y=parent_label,

                   item_tfms=Resize(460),

                   batch_tfms=aug_transforms(size=224, min_scale=0.75))

dls = dblock.dataloaders(path, bs=64)

model = xresnet50(n_out=dls.c)

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)

learn.fit_one_cycle(5, 3e-3)

Our baeline model gives us already 81.6% of accuracy after 7.5 min. We will see what happens if we perform normalization of all the images.

The importance of normalization

When training a data set, it helps the model to have all images normalized, that means, to have mean 0 and standard deviation 1 across its channels. Most commonly the rgb values will range from 0 to 255 and many images have values from 0 to 1, and therefore they will need to be normalized. 

When training a model from sctracth, it is important to store the mean and standard deviation from the actual images, so during inference or transfer learning, the same normalization is applied. 

As the data set we are working on now is a subset of ImageNet, we will apply normalization taking the mean and standard deviation of the actual ImageNet, which is available in the fastAI library.

def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)

This is not going to improve the performance of the model training from scratch, as our data set is a subset of the ImageNet, but this will be key when using pretained models or data with very sparse pixel value distributions.

In case we normalize but not with imagenet values, fastai will pick the mean and standard deviation from the batch from which is applying the normalization.

Let's see an important trick that will certainly improve our training time and performance.

Progressive Resizing

This is a rather simple but strong idea. Imagine you are training a model from scratch, and, in order to help the model to learn quickly the essential features of an image, you will start passing rather small versions of the image, fit some epochs, and later increase the image size, before you train for another set of epochs.

I never think about it but it makes a lot of sense!

CNN learns first simple features such as corners and edges, which are independent of the image size
We can apply transfer learning on smaller size images on the bigger ones
Different images size are used as an augmentation type, making the model less prone to overfitting, compared to the case we train for many epochs on the same data
For a fixed number of epochs, 4 epochs on 128x128 an 4 epochs on 224x224 is certainly faster than 8 epochs on 224x224.

dls = get_dls(128, 128)

learn = Learner(dls, xresnet50(n_out=dls.c), loss_func=CrossEntropyLossFlat(), 

                metrics=accuracy)

learn.fit_one_cycle(4, 3e-3)

learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

With this approach we get 85.4% which almost 4pp more than the baseline.

Test Time augmentation

During the validation set, fastai performs center cropping, missing out edges. Any augmenation we are going to perform, is likely going to either miss some important information due to cropping, or it is going to be dirtoting information due to squeezing. One solution to that is to apply a prediction over several different transformations and average them to get a more robust result.

This approach can result in dramatic improvements in accuracy. The good thing is that training time does not increase, only the validation or inference time by the number of augmentations performed. By default fastai will be using unaugmented center crop images plus four randomly augmented images.

preds,targs = learn.tta()

accuracy(preds, targs).item()

With this approach we jumped to 86.5% of accuracy, which is a full point after the progressive sizing and six points with respect baseline. Not too bad for zero training extra cost!

How Mixup helps with generalization, training at the expense of more epochs

When training a model from scratch, this technique could be very useful. In general, it is hard to decide which data augmentations to use a priori, unless you have a of domain knowledge of the data you are using. Mixup is less data dependent and provides a generic approach for augmenting our data.

Mixup performs the following:

picks two images at random
Merges the images into one using a weighted average with random weights
Adjust labels values by the randomly selected weights

Let's take two images and apply a 50% weight on each:

Note that the labels will change too, so we will have a 0.5 for the church label, 0.5 for the gas station label, and 0 for the rest of the classes.

In order to use mix up, we use the cbs parameter set as Mixup as follows:

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=MixUp())
learn.fit_one_cycle(50, 3e-3)

Given the fact that the task is much harder to perform, we will need more epochs than before, so we train for  50 epochs to be sufficient to beat our previous improvements, getting a 91.1%.

There is something very good about this approach, which is that our labels no longer become 0 or 1, as and thefore we do not need to push our predictions to the extreme, having a much nicer behaved loss, as long as we picked a linear combination of images from different classes.

Another way to smooth our labels is to apply an smoother, which we will see in the next section.

Label Smoothing

Here again, the idea, is rather simple but clever. Since uur one hot encoding of the classes pushes our algorithms to be extremely confident about what the image contains and does not contain, we tend to overfitting, as the gradient keeps being big enough when we already are confident of the right class with 95%, or even 99% chance.

To avoid that, we slightly reduce our correct class by n*epsilon, and increase each 0 class by epsilon. In that way we go from [0 0 1] being third the right class, to [e e 1-e+e/2]. In that way we no longer have 0's and 1's.

Using label smoothing only requires a change in the loss function used in the learner:

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), 
                metrics=accuracy)
learn.fit_one_cycle(50, 3e-3)

As with mixup, the problem to learn becomes harder and it is recommended to train more epochs. We do so, and we got 92%, after 50 epochs.

Apply all together

In this post, we have learned how to set up benchmarks on computer vision which do not required long feedback loops. Imagenette is a smaller version of ImageNet that tends to generalize well. Once should find smaller data sets of their own problems to get fairly fast feedback from their ideas and experiments.

Here we see that training models from scratch should be using normalized images and store its original mean / standard deviation for others to leverage transfer learning on our models.

In order to improve the performance of our models and its generalization we can perform Mix up to randomly average two images and its labels, and also to smooth the labels directly so the loss function is not pushed to be overconfident. Both cases will normally take more epochs to train than what we have seen so far when doing transfer learning or training from scratch without those techniques in place.

Last but not least, applying augmentations on testing time will boost the performance of our model with little compromise at training time but a proportional time for inference depending on the number of augmenations used.We have shown that applying those methods will lead from a baseline model of 81% accuracy to >92% on Imagenette.

The repo can be found here : github link

Alan Fortuny Sicart

Search This Blog