When the images one is working with are of high resolution, or the demands of the application allow for very little margin of error (autonomous driving, medical analysis) it makes sense to consider high resolution image translation. This post aims to summarize the following paper:
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
Conditional GANs have been proven to provide fairly good translations from sketches to images, but its applications have been limited to lower resolution images. The following paper develops and algorithm that works on high resolution image translation 2048 × 1024 with very realistic generation.
Introduction
Creating realistic representations of the world is quite expensive computationally if every dimension and detail have to be model explictely. It becomes very necessary to find low weight approaches that could represent reality very realistic from a simple abstraction as an input.
The following papers uses semantic label maps as inputs and creates high definition images:
The following method addreses the main issues conditional GANs: the difficulty of generating high resolution images and the lack of details and realistic textures. To do so, it changes the adversarial learning objective together with new multi-scale generator and discriminator architectures.
There is an interesting feature about this network as it allows to change certain labels or objects of the input affecting the generated image. That could be an interesting application for product design.
Related Work
Generative adversarial networks model the natural image
distribution by forcing the generated samples to be indistinguishable from natural images. Inspired by their successes, this paper propose a
new coarse-to-fine generator and multi-scale discriminator
architectures suitable for conditional image generation at a
much higher resolution.
In image-to-image translaion , the goal is to translate an input image from
one domain(sketch, label maps, segmentation maps) to another domain given input-output image
pairs as training data.
Compared to L1 loss, which often leads to blurry images the adversarial loss has been used.
The reason is that
the discriminator can learn a trainable loss function and
automatically adapt to the differences between the generated and real images in the target domain.
Conditional GANs are problematic to generate high-resolution images due to the training instability and optimization issues. Changes in the loss function can overcome that problem and provide high resolution generated images.
There has been some successes in allowing users to interact with the image creation, but the models do not allow for clear disentanglement of objects (style transfer) or do not have high resolution generated images . The proposed framework overcome this two limitations.
Instance-Level Image Synthesis
The framework uses as a baseline the pix2pix model explained in Part 2 of this post series. We will jump right into how to increase realism and resolution.
We decompose the generator
into two sub-networks: G1 and G2. We term G1 as the
global generator network and G2 as the local enhancer
network. The generator is then given by the tuple G =
{G1, G2} as visualized in Fig. 3. The global generator network operates at a resolution of 1024 × 512, and the local
enhancer network outputs an image with a resolution that is
4× the output size of the previous one (2× along each image dimension).
During training, we first train the global generator and
then train the local enhancer in the order of their resolutions. We then jointly fine-tune all the networks together. We use this generator design to effectively aggregate global and local information for the image synthesis
task. We note that such a multi-resolution pipeline is a well established practice in computer vision and two-scale is
often enough.
Multi-scale discriminators High-resolution image synthesis poses a significant challenge to the GAN discriminator
design. To differentiate high-resolution real and synthesized images, the discriminator needs to have a large receptive field. This would require either a deeper network
or larger convolutional kernels, both of which would increase the network capacity and potentially cause overfitting. Also, both choices demand a larger memory footprint
for training, which is already a scarce resource for high resolution image generation. To address the issue, we propose using multi-scale discriminators. We use 3 discriminators that have an identical network structure but operate at different image scales.
The discriminators D1, D2 and D3 are
then trained to differentiate real and synthesized images at
the 3 different scales, respectively. Although the discriminators have an identical architecture, the one that operates
at the coarsest scale has the largest receptive field. It has
a more global view of the image and can guide the generator to generate globally consistent images. On the other
hand, the discriminator at the finest scale encourages the
generator to produce finer details. This also makes training
the coarse-to-fine generator easier, since extending a low resolution model to a higher resolution only requires adding
a discriminator at the finest level, rather than retraining it from
scratch.
We improve the GAN loss by incorporating a feature matching loss based on
the discriminator. This loss stabilizes the training as the
generator has to produce natural statistics at multiple scales.
Specifically, we extract features from multiple layers of the
discriminator and learn to match these intermediate representations from the real and the synthesized image. Our GAN discriminator
feature matching loss is related to the perceptual loss, which has been shown to be useful for image super resolution and style transfer.Our full objective combines both GAN loss and feature
matching loss.
For high resolution generations we need both semantic segmentation maps and object boundaries. The reason is that when
objects of the same class are next to one another, looking at
the semantic label map alone cannot tell them apart. To extract this information, we first compute
the instance boundary map (Fig. 4b).
The instance boundary map is then concatenated with the
one-hot vector representation of the semantic label map, and
fed into the generator network. Similarly, the input to the
discriminator is the channel-wise concatenation of instance
boundary map, semantic label map, and the real/synthesized
image. Figure 5b shows an example demonstrating the improvement by using object boundaries.
To facilitate the manipulation of the generator at the object level, it is proposed to add additional low-dimensional feature
channels as the input to the generator network. By manipulating these features, there is flexible
control over the image synthesis process. Furthermore, note
that since the feature channels are continuous quantities, our
model is, in principle, capable of generating infinitely many
images.To ensure the features are consistent within each instance, an instance-wise average
pooling layer to the output of the encoder is added to compute the
average feature for the object instance.
The encoder is jointly trained with the generators and discriminators.
After the encoder is trained, we run it on all instances in the
training images and record the obtained features. Then we
perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for
a specific style, for example, the asphalt or cobblestone texture for a road. At inference time, we randomly pick one of
the cluster centers and use it as the encoded features. These
features are concatenated with the label map and used as the input to our generator.
Results
To quantify the
quality of our results, we perform semantic segmentation
on the synthesized images and compare how well the predicted segments match the input. The intuition is that if we
can produce realistic images that correspond to the input
label map, an off-the-shelf semantic segmentation model should be able to predict the
ground truth label. Both pixel-wise accuracy
and mean intersection-over-union (IoU), the method outperforms the other methods by a large margin. Moreover, is very close to the result of the original images,
the theoretical “upper bound” of the realism we can achieve.
Subjective studies has been performed too where the pix2pixhd model outperform in the amount of fooling the people asked more than other methods.
The reader is encourage to look at figure 10-13 for quite impressive generated images.
All in all, this approach shows that is better than the pix2pix baseline for high resolution image creation and image translation. What remains to be compared is the computational burden with respect the baseline.
Comments
Post a Comment