Convolutional Neural Networks (Part 3)

The previous parts could be directly plug in for image classification, or answering whether an object is or not in the image. If we want to know where the object is, or to detect multiple objects, we will need to adjust our networks to solve the localization problem.

The intuition behind localization

The idea is simple, on top of the class classification, we need to add the prediction of the center point of the object, the hight and width to create a bounding box of the object. Note that y will have then (coordinates center point (bx,by), height(bh), width(bw) and class (c), for each object. As you have seen, for every image, we will need to have 4 more data points than before.

The loss function dependts on the error, here we may want to use square error as we are not simply classifing the class. We would like to minimize the distance between the actual points and the predicted, when there is an object. When there is not an object, we can simply consider the classification error only.

Sliding windows

How do we detect objects? We need to go through the image with sliding windows, or small crops to check if the object is within. The size of that window is key to make sure we find a healthy balance between computational economy and accuracy. Knowledge about the expected size of the objects in the images certainly will help.

As you can visualize, when the slide window does not jump to far to the right- bottom of the image, there is a lot of overlap between the windows. If we perform a convolutional implementation of the sliding window, we can get in one go the fully connected vales of all the windows, see below:

That makes the computation faster, but it does not ensure that the window size is appropiate. The Yolo algorithm will offer a solution.

Yolo algorithm: you only look once ; )

If we break down each image in 3x3 grid, we can cover the whole image on different window sizes, for example 8 different window sizes, leading to a volume of 3x3x8, this is our target variable volume:

That should work as long as there is one object per window. In case you have more densely populated images, more granular grid such as 19x19x8 should work too.

Evaluating object localization

How do we evaluate that the predicted bounding box is accurate? we can compute the ratio between the intersection and the union, and if this is above 0.5-6, we can consider it as correct. (Intersection over union).

Note that it could be that multiple times we are localizing the same object, normally when the window is much smaller than the object. Non-max suppression allow you avoid this multiple misscounting of the object. The firststep would be to surpress highly overlap windows, while keeping those with highest probability.

Overlapping objects

After we have assigne the most likely center point and bounding box to each object, it could be the case that we have mutiple objects in the same location. To manage that we can use anchor boxes to assign each object to the right tupple. The workflow requires then to first create two bounding boxes per grid, get rid of low probability boxes for the classes in scope, and last but not least, use non-max suppression to generate the final predictions as one singe bounding box per object. The YOLO algorithm does that.

Semantic Segmentation

In semantic segmentation we want to map every pixel to an object and class. To do that we need to instead of fully connected layers at the end to predict one class or as many classes as pixels, we first compress with convolutions to later amplify back again to reach the original image size. This expansion or blow up requires transpose convolutions.

U-net architecture uses some skip layers on the first ayers all the way down to the last augmented layers. The idea behind it is to help the neural network to recover the details, missed our but the high level representation of the object. The ultimate layer become a concatenation of the abstract high level representation + the details from the early layers.

This models are no so popular as they require a lot of labelling. As I will share in future posts, the future and present of image representations is self supervised.

Alan Fortuny Sicart

Search This Blog