This Tensorflow tutorial for convolutional neural networks has three parts:
1. We shall look at some of the most successful convolutional neural network architectures like Inception, AlexNet, Resnet etc.
2. In the second part, we shall take a quick tutorial of a popular high-level and light-weight Tensorflow library called TensorFlow-Slim(TF-Slim).
3. Finally, using TF-Slim, we shall take pre-trained models of some of these networks and use them for the prediction on some images.
    In the end, I shall provide the code to run prediction/inference, so that you can run it on your own images. So, after finishing this quick tutorial, you shall have a fairly good understanding of a running image classification and you could run it on your own images.

1.Popular convolutional neural networks:

In this section, we shall talk about some of the most successful convolutional neural networks. Before that, let’s talk about ImageNet. Imagenet is a project, started by Stanford professor Fei Fei Li where she created a large dataset of labeled images belonging to commonly seen real-world objects like dogs, cars, aeroplanes etc. (Her TED talk is a recommended watch). Imagenet project is an ongoing effort and currently has 14,197,122 images from 21841 different categories. Since 2010, Imagenet runs an annual competition in visual recognition where participants are provided with 1.2 million images belonging to 1000 different classes from Imagenet data-set. Each participating team then builds computer vision algorithms to solve the classification problem for these 1000 classes. This competition is called ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and is considered an annual Olympics of computer vision with participants from across the globe including the finest of academia and industry. Most of these networks have come out of ILSVRC which works as a global benchmark.

Let’s look at the network architectures:

i) Alexnet:

Alex Krizhevsky changed the world when he first won Imagenet challenged in 2012 using a convolutional neural network for image classification task. Alexnet achieved top-5 accuracy of 84.6% in the classification task while the team that stood second had top-5 accuracy of 73.8% which was a record breaking and unprecedented difference. Before this, CNNs (and the people who were working on it) were not so popular among computer vision community. However, the tables were turned after this. Soon, most of the computer vision researchers started working on CNN and the accuracy has improved significantly over last 4-5 years.

Alexnet has 5 convolutional layers which are followed by 3 fully connected layers. Alex trained it on 2 GPUs as computational capacity was quite limited back in 2012.

Alexnet architecture in convolutional neural network

Architecture of Alexnet which won 2012 Imagenet challenge.


Alexnet used 11*11 sized filter in the first convolution layer which later turned out to be too large and was modified by following networks in coming years. Alex also used drop out, Relu and a lot of data augmentation to improve the results.

Also, before I forget, Alexnet was very similar to LeNet which was proposed by Yann Lecun in 1998 based on his understanding but didn’t have enough training data and computation capacity(yes, he’s genius! you can follow him on twitter here).

ii) VGG:

VGG was proposed by a reasearch group at Oxford in 2014. This network was once very popular due to its simplicity and some nice properties like it worked well on both image classification as well as detection tasks. However, If you are looking to build production systems in 2017 and someone suggests VGG, run. This is because the size of VGG network is too large(Imagenet pre-trained model is more than 500 MB) due to large fully connected layers. There are more accurate and less computationally efficient networks available.

VGG uses 3*3 convolution, in place of 11*11 convolution in Alexnet which works better as 11*11 in the first layer leaves out a lot of original information. One of the slightly crude analogy for filter size is: think of it as if you are breaking and examining image into sized 11*11 at one time. If you are “looking” at big chunks then you might miss finer details while with smaller chunks, you miss the context and spatial information. Don’t worry if it doesn’t make too much sense, network architecture design is much more complicated which I shall cover in more details in a future post. VGG achieved 92.3% top-5 accuracy in ILSVRC 2014 but was not the winner.

Vgg convolutional neural network: Tensorflow tutorial

VGG and its variants: D and E were the most accurate and popular ones. They didn’t win Imagenet challenge in 2014 but were widely adopted due to simplicity


iii) Inception: 

Alexnet was only 8 layers deep network, while VGG, ZFNet and other more accurate networks that followed had more layers. This proved that one needs to go deep to get higher accuracy, this is why this field got the name “Deep Learning”. However, proposed by a team at Google, Inception was the first architecture which improved results by design not by simply going deep.

The Idea of inception was to use different sized filters on the same image(i.e. 1*1, 3*3, 5*5) and then concatenating the feature to generate a more robust representation. So, this is how it was initially proposed.
Naive inception module: convolutional neural network

Naive inception module which was replaced by a sophisticated one later

However, 1*1 convolutions were added later and inception module that worked was:
Inception network: Tensorflow tutorial

Refined Inception module

 You can imagine this better with this image below:
inception module shown in Tensorflow tutorial

Inception module visualized: On the left is input which is “looked” in difference chunks and final output is a concatenation of all 3

The first version of Inception network was 22 layer network and was called GoogLeNet(to honor Yann Lecun’s LeNet) and it won 2014 Imagenet challenge with 93.3% top-5 accuracy. However, later versions are referred as InceptionVN where N is the version number so inceptionV1, inceptionV2 etc. Size of an Imagenet pre-trained model on InceptionV3 is 104 MB.
As mentioned earlier, all the network architectures before Inception simply stacked layers on top of each other. Inception was the first network that got creative with placement and proved that it’s possible to improve the accuracy and save on computation by doing that. You can think of Inception module as a micro network inside another network. So, this also encouraged the whole community to be more creative with network designs.

iv) Resnet:

Microsoft won Imagenet challenge in 2015 by using 152 layer Resnet network which used a Resnet module:

Resnet module: convoltional neural network tutorial

Resnet module proposed by Microsoft


They achieved 96.4% top-5 accuracy on Imagenet 2015. They also experimented with much deeper models like 1000 layer networks but accuracy droped with too deep models probably due to overfitting.

v) SqueezeNet:

SqueezeNet is remarkable not for its accuracy but for how less computation does it need. Squeezenet has accuracy levels close to that of AlexNet however, the pre-trained model on Imagenet has a size of less than 5 MB which is great for using CNNs in a real world application. SqueezeNet introduced a Fire module which is made of alternate Squeeze and Expand modules.

Squeezenet fire module

SqueezeNet fire module

With networks like SqueezeNet, it’s possible to reduce the amount of computation(hence, energy) required for similar kind of accuracy. “Given current power consumption by electronic computers, a computer with the storage and processing capability of the human mind would require in excess of 10 Terawatts of power, while human mind takes on 10 watts.” So, a lot of improvement is required in the efficiency of current neural networks and SqueezeNet is one step in that direction.

Great! you are now familiar with most of the popular and useful convolutional neural networks. In the next two sections, you shall run them on your own images.


2. TensorFlow-slim tutorial: 

 In order to build convolutional neural networks using Tensorflow python API, we need to write a lot of(boilerplate) code. So, there are multiple high- level APIs that are written on top of it to make it simple. TF-slim, written by Nathan Silberman and Sergio Guadarrama is one of the most prominent light-weight such an API which allows you to build, train and evaluate complex models easily.
 In this Tensorflow tutorial, we shall learn the basics of TF-slim and build inceptionV1 network using it.
2.1) Installing TF-Slim

a) From Tensorflow version 1.0 and above:

TF-Slim is available with versions 1.0 and above as part of tf.contrib.slim. To check if it’s working properly, you can run following line on the shell, it should not throw any errors:

b) For older versions:

If you happen to be running older version of Tensorflow and can’t upgrade for whatever reasons, install using following steps:

Go to the Jenkins page for Tensorflow project (where status of all the latest versions of Tensorflow are managed by Tensorflow team) and take the latest nightly(If you are not familiar with Jenkins, just google continuous integration).

Tf-Slim installation for Tensorflow tutorial

Jenkins for Tensorflow

Click on a relevant link, for example if you run ubuntu with gpu click on ‘nightly-matrix-linux-gpu’. On this page, you shall see a matrix for various python versions for PIP and NO_PIP versions. Choose NO_PIP with relevant python which will look like:,TF_BUILD_IS_PIP=NO_PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/

Finally, append that with your relevant Tensorflow version .whl file. Now, install using:

Congratulations! This will install TF-Slim on your machine. To test your installation:

This should not throw any errors.

2.2) TensorFlow-Slim tutorial:

In order to import TF-Slim just do this:

There are the 3 most important features of TF-slim which we are going to cover now: 

2.2.1. Quick Slim variables

2.2.2. Higher level layers

2.2.3. Arg scope and using it

Let’s get started. Shall we?

2.2.1. Quick Slim variables:

Let’s say you want to create a variable named weight’ using TF-slim, with a shape of [10 10 3 3] with initial values generated from a random distribution of standard deviation of 0.1 and a l2_regularizer with scale 0.5 and want to put it on GPU 0, you can do all this with a single line code using TF-slim: 


while the same will take multiple lines in vanilla Tensorflow code.

All the trainable variables in the model are called model variables in TF-slim. Examples are variables created with tf.con2d and tf.fully_connected. Using slim, this is how you create model variables:
To access model variables: 

model_variables = slim.get_model_variables()

All the other variables are non-model variable like global_step variable. To access all the slim variables: 

regular_variables_and_model_variables = slim.get_variables()

When you create a model variable via TF-Slim’s layers or directly via the slim.model_variable function, TF-Slim adds the variable to a the tf.GraphKeys.MODEL_VARIABLES collection.

2.2.2. Higher level layers:

In order to create a convolutional layer using TF-slim which has 128 filters of size 3*3, this single line of code is sufficient. 

while the same with pure Tensorflow code will be very long: 


So, that’s simple. Now, if you have same layer many times in neural network like this, the code is cleaner but can it get even simpler.
In this case, you can use repeat to do the same thing in a single line of code:

net = slim.repeat(net, 3, slim.conv2d, 256,[3, 3], scope='conv3')

What if your layers are almost the same except one or few parameters? 

Here, we had to build 3 fully connected layers with only 1 different parameter.  You can use stack like this: 

Similarly for conv layers:

2.2.3. Arg scope and using it

In the following example, look at the part where it says with tf.name_scope(‘conv1_1’) as scope:”, this is Tensorflow using name_scope to keep all the variables/ops organized. Here, we are creating 1st convolutional layer so we have added  conv1_1’ as a prefix in front of all the variables. This helps us in keeping things organized as all the weights in this layer are referred as conv1_1/weights’. 
Building on top of this, Tf-slim uses arg-scope which allows you to specify arguments or operations that will be applied to everything defined within arg-scope. 
Look at the code below, we are creating 3 convolutional layers which have different number of filters i.e. 64, 128, 256 respectively with padding of SAME, VALID, SAME and we use the same weights initializer and regularizers. 
Let’s redefine these using TF-Slim arg-scope: 
Note that, weights initializers,padding and regularizers have been defined within arg_scope for slim.conv2d. So, these operations will be passed to conv2d layers created using slim within the scope block. However, individual layers do have the option of overriding the default by declaring them locally, as in this example, second layer uses padding=’VALID’. 
Now, we have covered all the important concepts of TF-slim to build neural network. Let’s use this learning to run some of the Imagenet pre-trained models in the next section.

3. Running Imagenet pretrained models:  

 In this section of Tensorflow tutorial, I shall demonstrate how easy it is to use trained models for prediction. Let’s take inception_v1 and inception_v3 networks trained on Imagenet dataset. You can find more Imagenet models here.  Without changing anything in the network, we will run prediction on few images and you can find the code here. Both the pretrained models are saved in slim_pretrained folder:

The complete networks have been kept in nets folder. and are the files which define inception_v1 and inception_v3 networks respectively and we can build a network like this:

This will give the output of the last layer which can be converted to probabilities using softmax. probabilities = tf.nn.softmax(logits)

Now, we have built our Tensorflow graph, the second step is to load the saved parameters in the network. For this, we shall use assign_from_checkpoint_fn function in which we shall specify the complete path of saved pre-trained model and all the variables we want to save. For inception_V1 network which has inceptionV1 as scope, we shall get all the model variables using slim.get_model_variables(‘InceptionV1’). Now we have created the graph and initialized the network parameters from the saved model, we can run the network inside a session:

Finally, this will give out the index of labels, so we use inside dataset folder to get the mapping to print out the probabilities and show the results. Let’s Look at some of the results:

a). Inception_v1 result: Let’s run the prediction on this image of coffee:

results of inference on inception_v1 during Tensorflow tutorial

Inception_v1 identifies this image of expresso, however it is only 36% confident


Look at other results, some of them are are wrong but close like soup bowl.

b. Inception_v3 result: Let’s run inception_v3 which is much larger in size and more accurate.


Tensorflow tutorial

Prediction of inception_v3 on this image is spot-on and the model is very confident about it

This seems to be pretty good. Let’s run inception_v3 on the earlier image:

 Look at the results now. The model is very confident this time with 99% accuracy.
 Tensorflow tutorial