FastAI Deep Learning Journey Part 3: Deep dive into the mechanics of deep learning

In this lesson, we deep dive into the process of deep learning. Note that I cover deep convolutional networks in previous posts:https://alanfortunysicart.blogspot.com/2021/12/convolutional-neural-networks-part-1.html.

Here I provide the unique perspective from fastai. It is very interesting to see how different people approach deep learning in different ways, I find the following section very interesting to understand the important of tensors to ensure high speed training for large data sets. Even if you are familiar with deep learning, you will probably learn somethinga about programming or deep learning you did not know before (at least I did).

As with the other posts, let me start with the key lessons from the section.

Before jumping into deep learning methods, it is important to define a baseline, which is a simple yet reasonably good model that will give us an initial value of our metric/s (the kpi we care for our application)
Image data and in general high dimensional data is still process as a bunch of numbers, that could be fastly computed using tensor ( the numpy array equivalent for large data)
Tensors are key data types to speed up data management and training, to leverage GPU computing
Fastai is rooted on Pytorch, which is a middle level framework to implement deep learning. Instead of going to Pytorch from scracth I suggest to start with what fastai can do, and only go to the lower level coding of Pytorch if more flexibility is required ( my expectations is that >90% of the problems can be approached with fastai).
For non profficient programmers, list comprenhensions should be learned and used instead of for loops
One can apply a function or a comparison of any type between a single low dimensional tensor and many big dimensional tensor using broadcasting. It is very fast and the code looks awesome.
One can approximate almost any function with a couple of layers and one nonlinearity activation function. BUT, deeper neural networks with more layers but less nodes per layer performs better in most metrics and in terms of computing.
The digit data set is key to should the value added of deep learning, and also to understand the complexity of computing gradient descent from scratch or even to use pytorch instead of fastai.

Let's go throgh the key code snippets and comments on the following notebook below:

https://github.com/afortuny/DeepLearningFastAI/blob/d01f494392dc4b21bfd9adabdf984cda4c5ccd80/Lesson%204%20Digit%20Classifier.ipynb

Creating a baseline -an example of a handwriting digit classifier

Before jumping into the details of how Scochastic Gradient Descent works and how to achieve state of the art results in digit classification (and in general image classification problems), we create a baseline for the digit classifier problem.

We do something relatively simple but yet accurate for the problem at hand, which is calculate the mean of each pixel for each number, so we get average or "ideal" numbers, from which we will compare with each image. The baseline will predict the each image to the number closest to the "ideal" or average number.

During the contruction of the baseline, we can leverage list comprenhensions, broadcasting and tensor management, which are key coding skills for any deep learning practitioner.

Here is the result of the baseline of a 2 and 7 digit classifier (the hardest of all):

As you can see we achieve overall 93% accuracy just by comparing pixel to pixel.

Using Stochastic Gradient Descent

There are houndreds of tutorials about the stochastic gradient descent, so I am not repeating what others have done, just want to share a couple of very interesting observations:

The first one is the difference between the loss function and the metric. Note that to calculate the gradients and to ensure there is learning happening (or gradients not equal to zero) we need biases terms but most important a loss function that is differentiable. In many cases where accuracy is the metric we care the loss function should be in line with the metric but should be well behave. In our case we use the following:

def mnist_loss(predictions, targets):
    return torch.where(targets==1, 1-predictions, predictions).mean()

    Note that this will change as any little change in the prediction happen, so the gradient will rarely 
        become zero. 

The second important topic is the usage of batches and Data Loaders. Batches allow us to speed up gradient calculation, as we do not have to go through all the data, while remain robust. The concept of bacth size is key to ensure our GPUs do not get out of memory. I saw many applications using a bacth size around 200 images as a reference (for standard resolution of 224x224). DataLoaders are great way to create bacthes that will be used during our training time.

The first example is of a list with numbers and letters

For this example note that we list and zip or train and labe data set:

dset = list(zip(train_x,train_y))

dl = DataLoader(dset, batch_size=256)

The third one is the pytorch implementation of gradient descent, the following code snippets contain the most important pytorch functions to achieve it:

               def calc_grad(xb, yb, model):
                    preds = model(xb)
                    loss = mnist_loss(preds, yb)
                    loss.backward()

def train_epoch(model, lr, params):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

Understanding what fastai Learner does:

The previous section is key to understand the different components of a very useful function from fastai called Learner (in latest version is Vision.Learner for computer vision).

learn = Learner(dls, nn.Linear(2828,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

learn.fit(10, lr=lr)

With those two lines of code we can perform the very same calculations we have done with pytorch
but in a very synthetic way, what we need to define is:

our DataLoader object dls
our model, in that case a simpler Linear model with 2828 input pixels and one layer
our optimization enginer, in our case basic Stochastic Gradient descent
our loss function, in our case the one that is close to what the key metric is for the MNIST data set
our key metric, in our case the accuracy
Once we put all this into the learner (we can leverage predefined functions for each) we have to define for
many epochs (in the example 10) and which learning rate (in our runs 0.00001) we want to use.

Adding NonLinearity and using Transfer Learning
The previous model gave us 95% of accuracy, higher than the baseline but not 99% as the state of the arts models can do.
We can add to our model nonlinearity and an additional layer to reach 98% as follows.

simple_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,1)
)

learn = Learner(dls, simple_net, opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

Note the dimension consistency between the first linear layer (28,28,30) and the second (30,1). What we are doing
here is collapse the features from 28x28 to 30. There is not clear answer what should be the right number,
but the following rule of thumb "almost always" apply:

deeper (more layer) networks with less nodes (inputs-output dimensions) are better than shallower networks with more nodes.
That means that it iis faster and more accurate to condense the input more times in lower dimensions.

We can test the performance of the model using a pretrained resnet (residual net) with 18 layers:

dls = ImageDataLoaders.from_folder(untar_data(URLs.MNIST_SAMPLE))
learn = vision_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)

We not only achieve >99% but also in only one epoch. This is another key take away.

As we know that deeper neural networks and transfer learning requires less training time, code and achieve better performance,
we should try with transfer learning to hit our baseline before we develop a very customized model.
That makes for me fastai, a very important library for practitioners.

That's all for this lesson, as usual I close with the questionnaire and hope to see you in the next.

Questionnaire

How is a grayscale image represented on a computer? How about a color image? a grayscale can be represented as a two dimensional array with one channel, a color image as two dimensional array with three channels.

How are the files and folders in the MNIST_SAMPLE dataset structured? Why? There are two main folders train and validation, and within each each number has a folder, this folder is used to label the data.
Explain how the "pixel similarity" approach to classifying digits works. The pixel similarity approach calculate the "ideal" number as the average of each number-pixel observed. Then, each image is compared with the ideal numbers and the one with the least distance is the predicted number. For example is the distance between one image and the ideal 2 is the lowest, the image analyze will be prediced to be 2.
What is a list comprehension? Create one now that selects odd numbers from a list and doubles them. A list comprenhension allows you to do loop like operations much faster than the loop operator. It is cleaner and faster computationally, leveraging numpy or tensors can speed up the process orders of magnitude.
What is a "rank-3 tensor"? A rank three tensor is the amount of dimensions or axis contained in a tensor. A black and white image like the one of the MNIST data set would be a rank 3 tensor.
What is the difference between tensor rank and shape? How do you get the rank from the shape? The rank is the number of axes in a tensor, while shape is the size of each axis. A tensor containing 1000 black/white images with 28x28 resolution, will have rank 3 (1000,28,28) while its shape wilbe (1000,28,28).
What are RMSE and L1 norm? Those are two different ways of calculating distance. The first, calculates the distance, squared it and calculate the mean, at the end its computes the square root. The second calcaulates the mean absolute difference. The first puts more weight on large discrepances that the latter.
How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop? Using broadcasting one can compute for example the distance between each digit and the ideal digit. If you create a function like the following:

def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
One can use it for a single image a, or a tensor of 1000 images.

Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

          data = [[1,2,3],[4,5,6],[7,8,9]]

tensor(tensor(data)*2)[1:3,1:3]

What is broadcasting? the usage of different rank tensors multiple times to avoid duplication of the same data. Can be use to compare a tensor containing the mean against each single tensor.
Are metrics generally calculated using the training set, or the validation set? Why? It makes more sense to calculate the metrics on the validation set as we want to see how the model is doing on the kpi that we care, that is the metric, on unseen data, because that gives the indication that within the same domain the model generalizes well.
What is SGD? Stochastic gradient descent allow us to adjust the weights of our model in the direction that reduces our loss function.
Why does SGD use mini-batches? Mini batches allows to speed up gradient calculation as it does not wait till errors are computed over all samples, while being big enough to be representative. They are key to ensure sufficient memory is available at each GPU.
What are the seven steps in SGD for machine learning? initialize the random weights, make a prediction, calculate the loss, calculate the gradient, adjust the weights based on the gradient and learning rate, repeat the process with another prediction and iterate till performance is good enough or the process must stop.
How do we initialize the weights in a model? we do it randomly. One can use transfer learning and pick the initial weights of a pretrained model to accelerate the achievement of greater performance.
What is "loss"? A loss is a well behaved function, that means with derivative, that is as close as possible to our target metric.
Why can't we always use a high learning rate? because it can never converge and reach the optimum.
What is a "gradient"? a function indicating the direction of the loss function, given a very tiny change in a weight, letting the other weights fixed.
Do you need to know how to calculate gradients yourself? you can use pytorch instead of programming the derivatives yourself. It is good to know that the function is differentiable.
Why can't we use accuracy as a loss function? Accuracy may not change with little weight changes, making the gradient equal to zero and stop the process of learning.
Draw the sigmoid function. What is special about its shape? The sigmoid function compress the predicted value between 0 and 1, ensuring that very negative values lead to zero, while positive values lead to 1.
What is the difference between a loss function and a metric? the metric is what we care for our use case, the loss is the well behaved function as close as possible to our metric.
What is the function to calculate new weights using a learning rate? the gradient.
What does the DataLoader class do? the data loader allows you to process the training and validation data set and perform very useful actions such as augmentations and batches.
Write pseudocode showing the basic steps taken in each epoch for SGD.

initialize the weights
make prediction
calculate loss
calculate gradient
update weights with gradient and step size defined from the learning rate
go back to step 2 in validation metric is not good enough or training time exceeds the budget

Create a function that, if passed two arguments [1,2,3,4] and 'abcd', returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure?

def Dlanload(data, classes):
return print(list(zip(data,classes)))
data = [1,2,3,4]
classes = ['a','b','c','d']
Dlanload(data, classes)

This data structure makes tupples for each data and class.

What does view do in PyTorch? . We'll concatenate them all into a single tensor, and also change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor). We can do this using view, which is a PyTorch method that changes the shape of a tensor without changing its contents. -1 is a special parameter to view that means "make this axis as big as necessary to fit all the data":
What are the "bias" parameters in a neural network? Why do we need them? the bias is a random number that we add to the network to minimize the chance that the weights are too close to zero to push for any learning to happen (too little gradients).
What does the @ operator do in Python? it defines the next element as a decorator. Decorators are a very powerful and useful tool in Python since it allows programmers to modify the behaviour of function or class. Decorators allow us to wrap another function in order to extend the behaviour of the wrapped function, without permanently modifying it. But before diving deep into decorators let us understand some concepts that will come in handy in learning the decorators.
What does the backward method do? To calculate the gradients we call backward on the loss. But this loss was itself calculated by mse, which in turn took preds as an input, which was calculated using f taking as an input params, which was the object on which we originally called required_grads_—which is the original call that now allows us to call backward on loss. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus's chain rule under the hood to calculate these gradients.
Why do we have to zero the gradients? because a change in the weight is not changing the loss function. It could also be caused to a non well behaved loss function, such as accuracy.
What information do we have to pass to Learner? the data loader, the architecture of the model, the optimizer, the loss function and the metric to display.
Show Python or pseudocode for the basic steps of a training loop. Initialize - predict - calculate loss - caculate gradient- define the step and repeat or stop
What is "ReLU"? Draw a plot of it for values from -2 to +2. A Relu function transform any negative value to zero and keeps any positive value.It provides the nonlinear behaviour that makes deeplearning, together with the multiple layers, a very flexible model.
What is an "activation function"? An activation function takes some inputs and transform them in a linear or nonlinear way, which outputs are taken by the next layer.
What's the difference between F.relu and nn.ReLU? F relu provides the transformation out of the box, while nn Relu is a layer in the network we build, and behave as activation function. nn.ReLU is a PyTorch module that does exactly the same thing as the F.relu function. Most functions that can appear in a model also have identical forms that are modules. Generally, it's just a case of replacing F with nn and changing the capitalization. When using nn.Sequential, PyTorch requires us to use the module version. Since modules are classes, we have to instantiate them, which is why you see nn.ReLU() in this example.
The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more? The reason is performance. With a deeper model (that is, one with more layers) we do not need to use as many parameters; it turns out that we can use smaller matrices with more layers, and get better results than we would get with larger matrices, and few layers.

Alan Fortuny Sicart

Search This Blog