What if you could use Artificial Intelligence to enhance your photos like those seen on TV? Image super-resolution is the technology which allows you to increase the resolution of your images using deep learning so as to zoom into your images. Check out this hilarious video:

What is Image Super-Resolution?

Image super-resolution is a software technique which will let us enhance the image spatial resolution with the existing hardware.

Low Resolution(LR) Image: Pixel density within an image is small, hence it offers few details.

High Resolution(HR) Image: Pixel density within an image is large, hence it offers a lot of details.

A technique which is used to reconstruct a high-resolution image from one or many low-resolution images by restoring the high-frequency details is called as “Super-Resolution”.

  Low-Resolution Image

Original High-Resolution Image

When we simply resize images in OpenCV or Scipy, the traditional methods such as “Interpolation” are used which approximate the values of new pixels based on nearby pixel values which leave much to be desired in terms of visual quality, as the details (e.g. sharp edges) are often not preserved.

Interpolated Image

Here we use deep learning to learn to predict these values using Generative adversarial networks. The training model can be used to generate high resolution images with details from low resolution images.

Understanding Deep Learning based Super-resolution:

Okay, let’s think about how we would build a convolutional neural network to train a model for increasing the spatial size by a factor of 4. As we already know that the convolution operation always reduces the size of the input. So, we will have to use a deconvolution or fractionally strided convolution or a similar layer(sub-pixel convolutional layer) so that the output image size is 4 times the input size. Training data is simple: we can collect a large number of high resolution(HR) images off the internet, then down-size them by a factor of 4 that will be our low resolution(LR) images. We feed the low-resolution image(let’s say a size of 20×20) to the network and train it to generate the high-resolution image(80×80). The objective of the network is to reduce the mean-squared error(MSE) between the pixels of the generated image and ground-truth image.


     represents the matrix of the original  image

g      represents the matrix of the reconstructed high-resolution image

M    represents the number of rows of pixels of the image and i represents the index of that row

   represents the number of columns of pixels of the image and j represents the index of that column

Hypothetically, if we get this error to zero, that means our network is now able to generate the good quality high-resolution images. So, we define a metric for quality:

Peak Signal to Noise Ratio (PSNR):

    It measures the deviation between the generated high-resolution image to the original image (Natural High-Resolution image). It can be defined as the ratio between the maximum pixel value(peak signal) of the input image (for e.g. if the input image is of 8 bit unsigned integer data type then this value will be equal to 255) to the MSE(Mean Squared Error) between the pixel values of the  reconstructed image and the original image expressed on logarithmic scale.


maxvalue  represents the maximum pixel value that exists in the input image(original target image)

The higher the PSNR the better the quality of the reconstructed image as it tries to minimize the MSE between the images with respect to the maximum pixel value of the input image. 

Unfortunately, maximizing PSNR alone doesn’t completely work for us as the images generated could be overly smooth which don’t look perceptually real. So, we use deep neural networks(say Vgg19 or Alexnet) as a feature extractor and take the difference in the feature map between the generated image and ground truth image as the loss for this network. This loss is called content loss.

Discriminator network and Adversarial Loss:

As mentioned earlier, PSNR is not a perfect metric to predict if the image is real or not.  It would be better if we could judge the quality of the generated image and reject the ones which are not realistic. That’s why we use another network that predicts how good/real is the generated image. This network is called discriminator network as it tries to predict whether the images generated by the generator are realistic or not.

So, effectively we use a generated adversarial network that will have two networks, the generator takes a low-resolution image as input and produces a high-resolution image as output. The discriminator decides how perceptually real is the generated image. Discriminator also adds a feedback to the generator called adversarial loss which predicts the probability if the generated image is realistic. The generator uses this as a cue to improve. Here is the complete training schema:

In the beginning, the generator produces very bad quality images but as the training continues, it starts to produce realistic images. By the time the training is completed, the discriminator can’t very well discriminate between the original high-resolution image and the one created by the generator.

Is Hollywood/CSI version of super-resolution even possible?:

As you can understand, we can’t create any information that wasn’t there in the original image. The new pixels that we create are just very good guesses based on training. However, it does improve the picture such that it’s easier for humans to visualize. So, if we have a low-resolution image which is only missing some pixels that make it harder for humans to interpret but the information is present then super-resolution can certainly help. However, in many cases, it appears that the original low-resolution images are extremely hard for humans to visualize and interpret which makes it feel like magic.

Overall, super-resolution is a pretty cool application of deep learning. It’s now possible to build very cool image enhancer software with deep learning to automatically apply super-resolution to images. It goes without say, as is the case with many deep learning models, it’s highly effective to train domain-specific models like faces or license plates in which we have seen super-resolution really do wonders!