Keras is winning the world of deep learning. In this tutorial, we shall learn how to use Keras and transfer learning to produce state-of-the-art results using very small datasets. We shall provide complete training and prediction code. For this comprehensive guide, we shall be using VGG network but the techniques learned here can be used to finetune Alexnet, Inception, Resnet or any other custom network architecture. In a previous tutorial, we used 2000 images of dog and cat to get a classification accuracy of 80%. However, with transfer learning, we shall achieve 98% accuracy with just 500 images each of dog and cat class. 

Why do Neural Networks need huge data?

Deep Learning is the new state-of-the-art. But it needs huge amount of training data. Why is that? Multilayer Feedforward networks are universal Approximator i.e. multilayer feed-forward neural networks can be used to model any problem. So, there are no theoretical constraints for their success. However, in the real world, there are so many constraints like insufficient data, limited compute etc. Fortunately, some of the network architectures(like alexnet, Vgg, Inception, Resnet etc) have been found to be great at the task of learning and they have been openly shared by the vibrant computer vision community. However, most of these networks have millions of parameters. If we train these networks with small datasets then it will result in overfitting meaning the network will only work for examples in training data or exactly similar examples but will not generalize well(will not work on additional examples).

Learning with little data(Transfer learning aka Fine-tuning):

In practice, instead of training our networks from scratch, everyone just first trains the network on 1.2 million images belonging to 1000 different classes from Imagenet data-set. These weights are saved and such saved weights are called ImageNet Pretrained weights. When one starts working on a specific problem where a small amount of training data is available, one takes these pre-trained weights and continue training. One should keep this in mind that the images for the task must be similar to ImageNet dataset otherwise the previously learning will not be that useful. For example if we have to do some training on medical images like MRI and X-Ray images then ImageNet pretrained models will not be of great use. Here is a loose analogy to transfer learning: Kids learn to read by alphabets and eventually when they are familiar with the language, they start reading by words. Pretraining our networks on ImageNet selects our weights in such a way that network becomes familiar with the kind of images that are common in our world. When we train these with small data during transfer learning, it’s easier to reach the weights that solve our problem.

Strategies for Fine tuning:

  1. Linear SVM on top of bottleneck features

    If you have very little data, it won’t be possible to do much training. The best strategy for this case will be to train an SVM on top of the output of the convolutional layers just before the fully connected layers( also called bottleneck features).

  2. Just Replace and train the last layer

     ImageNet pretrained models will have 1000 outputs from last layer, you can replace this our own softmax layers, for example in order to build 5 class classifier our softmax layer will have 5 output classes. Now, the back-propagation is run to train the new weights. In this case, however, it’s likely for us to overfit so a lot of data augmentation and proper cross-validation is important.

  3. Train only last few layers

     Depending on the amount of data available to you, the complexity of the problem you are solving, one can choose to freeze(don’t change weights during backpropagation) first few layers and train only last few layers. The initial layers of convolutional neural networks just learn the general features like edges and very general image features, it’s the deeper part of the networks that learn the specific shapes and parts of objects which are trained in this method. Another similar method is to use 0 or very small learning rate during the initial layers and using higher learning rate for the layers that are deeper.

  4. Freeze, Pre-train and Finetune(FPT)

    It’s one of the most effective technique in my experience and this is exactly what I will demonstrate in the code below. There are two steps involved here:  a) Freeze and Pretrain: First replace the last layer with a small mini network of 2 small Fully connected layers. Now, freeze all the pretrained layers and train the new network. Save the weights of this network(let’s call them pretrained weights) b) Finetune: Load the pretrained weights and train the complete network with a smaller learning rate. This results in very good accuracy with even small datasets.

  5. Train all the layers:

    In case you are fortunate to have millions of examples for your training, you can start with pretrained weights but train the complete network.

In the following section, we shall use fine tuning on VGG16 network architecture to solve a dog vs cat classification problem.

Finetuning VGG16 using Keras:

VGG was proposed by a reasearch group at Oxford in 2014. This network was once very popular due to its simplicity and some nice properties like it worked well on both image classification as well as detection tasks. VGG network has many variants but we shall be using VGG-16 which is made up of 5 convolutional blocks and 2 fully connected layers after that. See below:


Vgg 16 architecture

Input to the network is 224 *224 and network is:

  1. Conv Block-1: Two conv layers with 64 filters each. output shape:112 x 112 x 128
  2. Conv Block-2: Two conv layers with 128 filters each. output shape: 56 x 56 x 256
  3. Conv Block-3: Three conv layers with 256 filters each. output shape: 28 x 28 x 512
  4. Conv Block-4: Three conv layers with 512 filters each. output shape: 14 x 14 x 512
  5. Conv Block-5: Three conv layers with 512 filters each. output shape: 7 x 7 x 512
  6. Fc1(fully connected layer 1):output shape: 1x 1 x 4096
  7. Fc2(fully connected layer 2):output shape: 1x 1 x 4096
  8. Output(predictions):output shape: 1x1x1000 (For ImageNet)

There are two steps to our training methodology Freeze, Pre-train and Finetune(FPT):

Step-1: Freeze and Pre-train

Step-2: Finetune the pretrained weights.

Let’s go through them one by one.

Step-1: Freeze and Pre-train

Here are the 5 steps that we shall do to perform pre-training:

 1. Gather Training and testing dataset:

We shall use 1000 images of each cat and dog that are included with this repository for training. We shall show how we are able to achieve more than 90% accuracy with little training data during pretraining. Here we are preparing to receive path of training and testing data, number of classes via command line arguments.


 2. Image Augmentation:

It’s a standard practice in computer vision to augment the training dataset and prepare many examples from whatever data we have. Some of the common augmentations are like slight rotations, flipping images, small random crops etc. Idea is to add small perturbations without damaging the central object so that neural network is more robust to these kinds of real-world variations. Fortunately, keras provides a mechanism to perform these kinds of data augmentations quickly. ImageDataGenerator is an in-built keras mechanism that uses python generators ensuring that we don’t load the complete dataset in memory, rather it accesses the training/testing images only when it needs them. Another important thing to note here is we normalize by dividing each pixel value in all the images by 255, since we are using 8bit images so each pixel value is now between 0 and 1. This is also loosely called pre-processing of input images for VGG network. Our imagenet weights have also been obtained using the same normalization. Hence, if you miss this, you will get very bad predictions.

 3. Network graph and pre-trained weights:

We have already described the architecture above. Since, keras has provided a VGG16 implementation, we shall reuse that. This code is in file in the network folder. If we specify include_top as True, then we will have the exact same implementation as that of Imagenet pretraining with 1000 output classes. But since we only want to classify between dog and cat, we shall take only the initial 5 convolutional blocks.

Fine-tuning is one of the most common way to solve problems using AI and Deep learning. That’s why Keras provides us weights without the top along with complete imagenet weights.

This is how we call the above code for VGG16 model.


4. Append our network and choose fine-tuning parameters:

Okay, we have created the graph for convolutional blocks of VGG16 network loaded with imagenet pretrained weights. Pay attention here, for fine-tuning during this experiment, we say that we are happy with the weights of these layers so, don’t change these weights, rather we shall change the weights of the small network that we will be added in the next step.

Now, we add a mini network on top of that with two final outputs representing probabilities for cat and dog.

Now, we are ready for pre-training. In the next step, we shall specify the optimizer, loss etc and start the training.

 5. Specify the optimizer, loss etc and start the training.

We shall be using Tensorboard which is a very powerful visualization tool used to observe training process of neural networks. You can install it using this command:

What Tensorboard does is that it provides us an option to write the value of any variable used during training to a directory called logdir. These written values can be read and shown in your browser via a webserver that Tensorboard runs. So, when we create a Tensorboard instance we specify the location of this logdir on your computer.

This will create a log folder and save all the Tensorboard data there. During/After the training, we can start the tensorboard server by running this command. This will start a webserver on your local at port 6006.

Now, you can visualize the training process in your browser by opening http://localhost:6006.

We shall use Tensorboard via Keras callback utility which is a nice Keras inbuilt utility to run a specific function to during specific times during training like beginning or end of epochs.

We shall also use the callback utility to specify the path and name of the trained model.

Now, we can start the traing by running our pretraining script with appropriate paths.

Our training will now start and after pretraining we shall achieve close to 90% with only 500 examples of each class which is impressive considering that in a previous tutorial we were able to achieve only 80% accuracy with more than 2000 examples of each class.

You can see the training and validation accuracy graphs using Tensorboard. Run the graph using:

Browser on http://localhost:6000/ shows:

Step-2: Finetuning:

In the finetuning step, we shall load the weights(cv-tricks_pretrained_model.h5) saved in pretraining phase. The network will remain the same but we shall not freeze any layers i.e. the weights of all the layers will change during training.

Another important thing to note here is that we will not be loading Imagenet pretrained weights.

Rather, this is how we load the our own pretrained weights:

Important Note:

During Finetuning, we already have a model which is very good so we don’t want to change the weights too much. So, we would use an optimizer with a very slow learning rate. In general, SGD is good choice for this as opposed to adaptive methods like Adam etc.

I observed slight overfitting during training so, I increased the dropout to 0.3. Finally, I run the fine-tuning script to start the finetuning process, which gives us a nice cool 98% accuracy with just 500 images of each class.


Now, let’s run this script on a new image to see if our newly trained model able to identify cats and dogs. We shall build the same network graph and load weights that we have trained(cv-tricks_fine_tuned_model.h5). We shall use numpy to load our image and run prediction on it.

This prints a list which contains the probabilities for the image containing cat or dog.

[[  2.86891848e-13   1.00000000e+00]]

This implies that the image contains a dog and the model very confident about it. Complete code with data can be found here.

And that’s how it’s done. Congratulations! you have learned a very important lesson in your journey to learn AI. It would be a nice idea to try to pick-up a new dataset and train your own classifiers. Or even better use a different network like InceptionV3(which is one of my favorite due to high accuracy/computation ratio). Feel free to share your experience in comments.