In this lesson, we deep dive into the process of deep learning. Note that I cover deep convolutional networks in previous posts:https://alanfortunysicart.blogspot.com/2021/12/convolutional-neural-networks-part-1.html.
Here I provide the unique perspective from fastai. It is very interesting to see how different people approach deep learning in different ways, I find the following section very interesting to understand the important of tensors to ensure high speed training for large data sets. Even if you are familiar with deep learning, you will probably learn somethinga about programming or deep learning you did not know before (at least I did).
As with the other posts, let me start with the key lessons from the section.
- Before jumping into deep learning methods, it is important to define a baseline, which is a simple yet reasonably good model that will give us an initial value of our metric/s (the kpi we care for our application)
- Image data and in general high dimensional data is still process as a bunch of numbers, that could be fastly computed using tensor ( the numpy array equivalent for large data)
- Tensors are key data types to speed up data management and training, to leverage GPU computing
- Fastai is rooted on Pytorch, which is a middle level framework to implement deep learning. Instead of going to Pytorch from scracth I suggest to start with what fastai can do, and only go to the lower level coding of Pytorch if more flexibility is required ( my expectations is that >90% of the problems can be approached with fastai).
- For non profficient programmers, list comprenhensions should be learned and used instead of for loops
- One can apply a function or a comparison of any type between a single low dimensional tensor and many big dimensional tensor using broadcasting. It is very fast and the code looks awesome.
- One can approximate almost any function with a couple of layers and one nonlinearity activation function. BUT, deeper neural networks with more layers but less nodes per layer performs better in most metrics and in terms of computing.
- The digit data set is key to should the value added of deep learning, and also to understand the complexity of computing gradient descent from scratch or even to use pytorch instead of fastai.
Let's go throgh the key code snippets and comments on the following notebook below:
Creating a baseline -an example of a handwriting digit classifier
Using Stochastic Gradient Descent
- The first one is the difference between the loss function and the metric. Note that to calculate the gradients and to ensure there is learning happening (or gradients not equal to zero) we need biases terms but most important a loss function that is differentiable. In many cases where accuracy is the metric we care the loss function should be in line with the metric but should be well behave. In our case we use the following:
- The second important topic is the usage of batches and Data Loaders. Batches allow us to speed up gradient calculation, as we do not have to go through all the data, while remain robust. The concept of bacth size is key to ensure our GPUs do not get out of memory. I saw many applications using a bacth size around 200 images as a reference (for standard resolution of 224x224). DataLoaders are great way to create bacthes that will be used during our training time.
- The first example is of a list with numbers and letters
- For this example note that we list and zip or train and labe data set:
- The third one is the pytorch implementation of gradient descent, the following code snippets contain the most important pytorch functions to achieve it:
def train_epoch(model, lr, params):for xb,yb in dl:calc_grad(xb, yb, model)for p in params:p.data -= p.grad*lrp.grad.zero_()
Understanding what fastai Learner does:
learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(10, lr=lr)
With those two lines of code we can perform the very same calculations we have done with pytorchbut in a very synthetic way, what we need to define is:
- our DataLoader object dls
- our model, in that case a simpler Linear model with 28*28 input pixels and one layer
- our optimization enginer, in our case basic Stochastic Gradient descent
- our loss function, in our case the one that is close to what the key metric is for the MNIST data set
- our key metric, in our case the accuracy
Once we put all this into the learner (we can leverage predefined functions for each) we have to define formany epochs (in the example 10) and which learning rate (in our runs 0.00001) we want to use.
Adding NonLinearity and using Transfer Learning
The previous model gave us 95% of accuracy, higher than the baseline but not 99% as the state of the arts models can do.We can add to our model nonlinearity and an additional layer to reach 98% as follows.
simple_net = nn.Sequential( nn.Linear(28*28,30), nn.ReLU(), nn.Linear(30,1))
learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
Note the dimension consistency between the first linear layer (28,28,30) and the second (30,1). What we are doinghere is collapse the features from 28x28 to 30. There is not clear answer what should be the right number,but the following rule of thumb "almost always" apply:
- deeper (more layer) networks with less nodes (inputs-output dimensions) are better than shallower networks with more nodes.
That means that it iis faster and more accurate to condense the input more times in lower dimensions.
We can test the performance of the model using a pretrained resnet (residual net) with 18 layers:
dls = ImageDataLoaders.from_folder(untar_data(URLs.MNIST_SAMPLE))learn = vision_learner(dls, resnet18, pretrained=False, loss_func=F.cross_entropy, metrics=accuracy)learn.fit_one_cycle(1, 0.1)
We not only achieve >99% but also in only one epoch. This is another key take away.
As we know that deeper neural networks and transfer learning requires less training time, code and achieve better performance,we should try with transfer learning to hit our baseline before we develop a very customized model.That makes for me fastai, a very important library for practitioners.
That's all for this lesson, as usual I close with the questionnaire and hope to see you in the next.
- our DataLoader object dls
- our model, in that case a simpler Linear model with 28*28 input pixels and one layer
- our optimization enginer, in our case basic Stochastic Gradient descent
- our loss function, in our case the one that is close to what the key metric is for the MNIST data set
- our key metric, in our case the accuracy
Adding NonLinearity and using Transfer Learning
- deeper (more layer) networks with less nodes (inputs-output dimensions) are better than shallower networks with more nodes.
Questionnaire
- How is a grayscale image represented on a computer? How about a color image? a grayscale can be represented as a two dimensional array with one channel, a color image as two dimensional array with three channels.
- How are the files and folders in the
MNIST_SAMPLE
dataset structured? Why? There are two main folders train and validation, and within each each number has a folder, this folder is used to label the data. - Explain how the "pixel similarity" approach to classifying digits works. The pixel similarity approach calculate the "ideal" number as the average of each number-pixel observed. Then, each image is compared with the ideal numbers and the one with the least distance is the predicted number. For example is the distance between one image and the ideal 2 is the lowest, the image analyze will be prediced to be 2.
- What is a list comprehension? Create one now that selects odd numbers from a list and doubles them. A list comprenhension allows you to do loop like operations much faster than the loop operator. It is cleaner and faster computationally, leveraging numpy or tensors can speed up the process orders of magnitude.
- What is a "rank-3 tensor"? A rank three tensor is the amount of dimensions or axis contained in a tensor. A black and white image like the one of the MNIST data set would be a rank 3 tensor.
- What is the difference between tensor rank and shape? How do you get the rank from the shape? The rank is the number of axes in a tensor, while shape is the size of each axis. A tensor containing 1000 black/white images with 28x28 resolution, will have rank 3 (1000,28,28) while its shape wilbe (1000,28,28).
- What are RMSE and L1 norm? Those are two different ways of calculating distance. The first, calculates the distance, squared it and calculate the mean, at the end its computes the square root. The second calcaulates the mean absolute difference. The first puts more weight on large discrepances that the latter.
- How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop? Using broadcasting one can compute for example the distance between each digit and the ideal digit. If you create a function like the following:
def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
One can use it for a single image a, or a tensor of 1000 images.
- Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.
- What is broadcasting? the usage of different rank tensors multiple times to avoid duplication of the same data. Can be use to compare a tensor containing the mean against each single tensor.
- Are metrics generally calculated using the training set, or the validation set? Why? It makes more sense to calculate the metrics on the validation set as we want to see how the model is doing on the kpi that we care, that is the metric, on unseen data, because that gives the indication that within the same domain the model generalizes well.
- What is SGD? Stochastic gradient descent allow us to adjust the weights of our model in the direction that reduces our loss function.
- Why does SGD use mini-batches? Mini batches allows to speed up gradient calculation as it does not wait till errors are computed over all samples, while being big enough to be representative. They are key to ensure sufficient memory is available at each GPU.
- What are the seven steps in SGD for machine learning? initialize the random weights, make a prediction, calculate the loss, calculate the gradient, adjust the weights based on the gradient and learning rate, repeat the process with another prediction and iterate till performance is good enough or the process must stop.
- How do we initialize the weights in a model? we do it randomly. One can use transfer learning and pick the initial weights of a pretrained model to accelerate the achievement of greater performance.
- What is "loss"? A loss is a well behaved function, that means with derivative, that is as close as possible to our target metric.
- Why can't we always use a high learning rate? because it can never converge and reach the optimum.
- What is a "gradient"? a function indicating the direction of the loss function, given a very tiny change in a weight, letting the other weights fixed.
- Do you need to know how to calculate gradients yourself? you can use pytorch instead of programming the derivatives yourself. It is good to know that the function is differentiable.
- Why can't we use accuracy as a loss function? Accuracy may not change with little weight changes, making the gradient equal to zero and stop the process of learning.
- Draw the sigmoid function. What is special about its shape? The sigmoid function compress the predicted value between 0 and 1, ensuring that very negative values lead to zero, while positive values lead to 1.
- What is the difference between a loss function and a metric? the metric is what we care for our use case, the loss is the well behaved function as close as possible to our metric.
- What is the function to calculate new weights using a learning rate? the gradient.
- What does the
DataLoader
class do? the data loader allows you to process the training and validation data set and perform very useful actions such as augmentations and batches. - Write pseudocode showing the basic steps taken in each epoch for SGD.
- initialize the weights
- make prediction
- calculate loss
- calculate gradient
- update weights with gradient and step size defined from the learning rate
- go back to step 2 in validation metric is not good enough or training time exceeds the budget
- Create a function that, if passed two arguments
[1,2,3,4]
and'abcd'
, returns[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
. What is special about that output data structure?
def Dlanload(data, classes):return print(list(zip(data,classes)))data = [1,2,3,4]classes = ['a','b','c','d']Dlanload(data, classes)
This data structure makes tupples for each data and class.
- What does
view
do in PyTorch? . We'll concatenate them all into a single tensor, and also change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor). We can do this usingview
, which is a PyTorch method that changes the shape of a tensor without changing its contents.-1
is a special parameter toview
that means "make this axis as big as necessary to fit all the data": - What are the "bias" parameters in a neural network? Why do we need them? the bias is a random number that we add to the network to minimize the chance that the weights are too close to zero to push for any learning to happen (too little gradients).
- What does the
@
operator do in Python? it defines the next element as a decorator. Decorators are a very powerful and useful tool in Python since it allows programmers to modify the behaviour of function or class. Decorators allow us to wrap another function in order to extend the behaviour of the wrapped function, without permanently modifying it. But before diving deep into decorators let us understand some concepts that will come in handy in learning the decorators. - What does the
backward
method do? To calculate the gradients we callbackward
on theloss
. But thisloss
was itself calculated bymse
, which in turn tookpreds
as an input, which was calculated usingf
taking as an inputparams
, which was the object on which we originally calledrequired_grads_
—which is the original call that now allows us to callbackward
onloss
. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus's chain rule under the hood to calculate these gradients. - Why do we have to zero the gradients? because a change in the weight is not changing the loss function. It could also be caused to a non well behaved loss function, such as accuracy.
- What information do we have to pass to
Learner
? the data loader, the architecture of the model, the optimizer, the loss function and the metric to display. - Show Python or pseudocode for the basic steps of a training loop. Initialize - predict - calculate loss - caculate gradient- define the step and repeat or stop
- What is "ReLU"? Draw a plot of it for values from
-2
to+2
. A Relu function transform any negative value to zero and keeps any positive value.It provides the nonlinear behaviour that makes deeplearning, together with the multiple layers, a very flexible model. - What is an "activation function"? An activation function takes some inputs and transform them in a linear or nonlinear way, which outputs are taken by the next layer.
- What's the difference between
F.relu
andnn.ReLU
? F relu provides the transformation out of the box, while nn Relu is a layer in the network we build, and behave as activation function.nn.ReLU
is a PyTorch module that does exactly the same thing as theF.relu
function. Most functions that can appear in a model also have identical forms that are modules. Generally, it's just a case of replacingF
withnn
and changing the capitalization. When usingnn.Sequential
, PyTorch requires us to use the module version. Since modules are classes, we have to instantiate them, which is why you seenn.ReLU()
in this example. - The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more? The reason is performance. With a deeper model (that is, one with more layers) we do not need to use as many parameters; it turns out that we can use smaller matrices with more layers, and get better results than we would get with larger matrices, and few layers.
Comments
Post a Comment