Recognition Problems
There are plenty of applications where we need to recognize a face, specie or car plate in a single image. This is challenging, as we need to get it right, while seeing only once that object specifically. Training that objects with one sample for each would not work neither. What we can do instead is to try to stablish the similarity between two images to stablish the likelihood of the same object being in both. In that way we can stablish if the object is in the training data based or not, given the similarity score and a threshold.
Siamese Networks anf the Triple loss
In order to calculate that similarity function, siamese networks tend to work fairly well. Those networks encode the image into a lower dimensional vector, from which we can compute a distance metric (cosine, euclidean...). The learning of that networks requires a specific loss function, called triplet loss. The triplet loss we have an anchor (the image to evaluate) and a positive and negative sample. The positive sample correspond to the same object (product, face, specie...) and the negative to another one.
We want our encoded representations to provide numbers that are close between the anchor and the positive sample and far between the anchor and the negative sample.
Formally, you go over groups of three images (anchor, positive and negative) over your training set:
Given the amount of images required to get state of the art, one may want to use transfer learning of a pretrained model, for example : https://pypi.org/project/facenet-pytorch/
It is also possible to train the siamese network as a binary classification problem ( anchor vs positive = 1 and anchor vs negative = 0). That simplifies the data preparation for training, it is not clear in the Deep learning specialization which one to use, as apparently both work well.
Neural Style Transfer
Visualizing the neural net maximum activation we can get an intuition of how neural nets learn the features of an image. The following paper shows how earlier layers get shapes and contours which are quite detailed at an small scale, while deeper layers extract more complex and abstract patterns (LNCS 8689 - Visualizing and Understanding Convolutional Networks (nyu.edu))
The content cost will be based on how different a certain layer between the original C and the generated image G are. The more different those are, the higher the cost. The layer should represent the content more than the style of the image. In the same train of reasoning, for the style we also try to minimize the style distance between the generated image G and the "inspiration" image S. Empirically, it is recommended to use multiple layers for the style cost function, as that lead to the best results.
Comments
Post a Comment