Data Set

Understanding Architecture Of Inception Network & Applying It To A Real-World Dataset

Fun fact: The inception model takes its name from a famous internet meme

Alifia Ghantiwala
5 min readMay 21, 2022


Index Of Contents
· Introduction
· The architecture of Inception V1
· How does this architecture reduce dimensionality?
· What is different in the Inception V3 network from the inception V1 network?
· Using transfer learning(pre-trained inception network) on an image classification problem.
· References

You would need to have a basic understanding of CNN before understanding the inception network. If you need to revisit the concept of convolution neural networks, you can check my article on the same topic.


Building a powerful deep neural network is possible by increasing the number of layers in a network Two problems with the above approach are that increasing the number of layers of a neural network may lead to overfitting especially if you have limited labeled training data and there is an increase in the computational requirement.

Inception networks were created with the idea of increasing the capability of a deep neural network while efficiently using computational resources.

I quote the paper(Going Deeper with Convolutions) below:

We propose a deep convolutional neural network architecture codenamed “Inception”, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.

Inception networks are released in versions, each version having some improvement over the previous ones. Let’s start our discussion with Inception Version 1 aka Inception V1.

The architecture of Inception V1

Consider the below images of peacocks. The area of the image occupied by the peacock varies in both images, selecting the right kernel size thus becomes a difficult choice. A large kernel size is used to capture a global distribution of the image while a small kernel size is used to capture more local information.


Inception network architecture makes it possible to use filters of multiple sizes without increasing the depth of the network. The different filters are added parallelly instead of being fully connected one after the other.


This is known as the naive version of the inception model. The problem with this model was the huge number of parameters. To mitigate the same, they came up with the below architecture.


How does this architecture reduce dimensionality?

Adding a 1X1 convolution before a 5X5 convolution would reduce the number of channels of the image when it is provided as an input to the 5X5 convolution, in turn reducing the number of parameters and the computational requirement.

Let me explain with an example.


What is different in the Inception V3 network from the inception V1 network?

Inception V3 is an extension of the V1 module, it uses techniques like factorizing larger convolutions to smaller convolutions (say a 5X5 convolution is factorized into two 3X3 convolutions) and asymmetric factorizations (example: factorizing a 3X3 filter into a 1X3 and 3X1 filter).

These factorizations are done with the aim of reducing the number of parameters being used at every inception module. Below is an image of the inception V3 module.


Using transfer learning(pre-trained inception network) on an image classification problem.

I would be solving the same problem we solved in the last article(link at the start of this article) using CNNs to compare the performance of using a vanilla CNN from a pre-trained inception network.

In case you have not read the previous article, we are trying to classify images into 6 different classes, the training data is fairly balanced and with a convolution neural network, we were able to achieve a validation accuracy of 77%.

Let us now use an inception model and train only its last layer as below.

#Code reference:
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras import layers
from tensorflow.keras import Model
local_weights_file = '../input/inception-weights/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5'
pre_trained_model = InceptionV3(input_shape = (150, 150, 3),
include_top = False,
weights = None)


for layer in pre_trained_model.layers:
layer.trainable = False

last_layer = pre_trained_model.get_layer('mixed7')
print('last layer output shape: ', last_layer.output_shape)
last_output = last_layer.output

x = layers.Flatten()(last_output)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(6, activation='softmax')(x)

model = Model(pre_trained_model.input, x)

model.compile(optimizer = RMSprop(lr=0.0001),
loss = 'categorical_crossentropy',
metrics = ['acc']),epochs=10,validation_data=validation_generator)

We were able to get a validation accuracy of 90%, by using the above architecture!

That is a fair improvement over the convolutional neural network we built last time:)


Next, I would want to try ResNet Model on the same problem to see if we can improve the accuracy even further.

Thanks for reading along!