Data Science

How Does A Deep Learning Network See?

A Beginners Guide To Convolution Neural Network (CNN)

Alifia Ghantiwala

--

Photo by Ion Fet on Unsplash
Index Of Contents
· Introduction
· What does the convolution layer do?
· What is a convolution operation?
· What are the hyperparameters in a convolution layer?
· What about the weights in the filter?
· Code in action!
· References

Introduction

Computer vision in today’s day and time can provide machines with a powerful incentive for visual perception. It helps us automate multiple tasks, such as CCTV monitoring in a supermarket or analyzing images made available by a satellite.

Convolution Neural Networks (CNNs) is a type of Artificial Neural Network, that aids in detecting patterns hidden within an image(thus providing a computer with vision!). It is different from the primary Multi-Layer perception because it has a hidden layer known as the convolution layer. What the convolution layer does is find patterns within your image, but how? Let’s find out

What does the convolution layer do?

Convolution layers are made of multiple filters, the size of the filter and the number of filters a layer would have are hyperparameters that can be tuned.

Filters detect patterns, consider below an image of 6X6 on the extreme left, it has a 3X3 filter to its right which is a vertical edge detector. After the convolution operation on the filter and the input image, the output image does recognize a vertical edge at its center.

Source

Similarly, we have filters that detect horizontal edges, circles, squares, and so on.

Filters in the earlier layers of a convolutional neural network detect the basic structure of an image like edges further in the network the later layers in the image can be made sophisticated enough to tell apart a male face from a female’s.

What is a convolution operation?

To convolve means to combine. The operation is a simple dot product of the image with the filter. The filter rolls over the input image and the dot product of the filter and the input become one cell of the output image.

Let me show you a visual representation of my explanation above. Below has been sourced from Deeplizard.

The filter used in the above demo was a 3X3 filter detecting the top edges, if you notice, the orange pixels highlight the same, whereas the blue filter is negative activations detecting the opposite i.e the bottom edges.

What are the hyperparameters in a convolution layer?

As discussed in the previous sections, filter size and number of filters are hyperparameters, another hyperparameter is stride.

Stride determines the step size by which you slide your filter. A stride of 1 means you move your filter by one step, and a stride of 2 means you move it 2 steps to the right every time the filter moves.

Source

Increasing the stride reduces the dimension of the feature map/output and also reduces the number of convolution operations performed(useful when computational power is a constraint).

Changing dimensionality at each layer might be a problem while setting up the network. You can avoid the same by using padding.

Padding is a technique to increase the dimensionality of the feature map(hence often used with stride as a countereffect). One type of padding is zero padding, for example, if you need to change the dimensionality from 3X3 to 5X5 and you use zero padding, 1 row each would be added at the top and bottom of the image and 1 column at the left and right.

Something like the below.

Image Source: Author

What about the weights in the filter?

With the advancement of deep learning, it is now not necessary for a computer vision specialist to set the value of the filters used in the convolution layer. To begin with, the filter is inserted with random values. With every epoch that the neural network trains with backpropagation, the weights are optimized so that the network can learn the patterns within the image.

Code in action!

Let’s code a CNN in TensorFlow Keras to deepen our understanding of the topic.

We would be using the MNIST data available on Kaggle.

Let’s do some basic EDA

Checking the balance of target variables in the train data.

sns.countplot(data["label"])
Image Source: Author

Checking for null values in the dataset

data.isnull().any().describe()
Image Source: Author

Seems like we are good to proceed.

We would first normalize the images, normalization helps the CNN to converge faster. I found a very good explanation for the reason, I would suggest you go through this article on Quora.

data = data/255.0

The image data is of dimension 28X28, we would reshape our data to the input dimension using the below.

data = data.values.reshape(-1,28,28,1)

Next, we convert the label values to a matrix using the to_categorical method available in tf.Keras.utils

Y_train = to_categorical(Y)

Now we are ready to define our model and use it for training and predictions.

#Code reference: https://www.kaggle.com/code/ayns123/digit-recognizer-with-cnn-acc-98
model = Sequential([
Conv2D(32,3,activation='relu',input_shape=(28,28,1)),
MaxPool2D(pool_size=2,strides=2),
ZeroPadding2D(padding=(2,2)),
Conv2D(64,3,activation='relu'),
MaxPool2D(pool_size=2,strides=2),
Flatten(),
Dense(128,activation="relu"),
Dense(10,activation="softmax")
])

I would want to explain the output shapes after each layer in the model, let's check the model summary first.

model.summary()
Image Source: Author

We start with our input image as 28X28X1 (1 is the number of channels as the MNIST data is grayscale we have only one channel, for RGB images the number of channels is 3)

Next, we have the Convolution layer with 32 filters of size 3X3, the output of which is a 4D array (None,26,26,32), the first element in the 4D array signifies the batch size which we have not defined and hence is None, the second and third element signifies height and width of the output.

The formula for its calculation is:(Reference: https://stackoverflow.com/questions/53580088/calculate-the-output-size-in-convolution-layer)

you can use this formula [(W−K+2P)/S]+1.

  • W is the input volume — in our case 28
  • K is the Kernel size — in our case 3
  • P is the padding — in our case 0
  • S is the stride — not provided.

So, we input into the formula:

Output_Shape = (28-3+0)/1+1Output_Shape = (26,26,32)

NOTE: Stride defaults to 1 if not provided and the 32 in (26,26,32) is the number of filters provided by the user.

Post that we have a MaxPooling layer which reduces the height and width sizes to half since we are using a 2X2 filter and stride of 2. I would encourage you to solve this using a piece of paper if you have not understood how MaxPooling shrinks the size, you should be able to solve it.

If you notice before the dense layer we use a flatten layer why is that so?

The dense layer input format is a 2D array → (batch_size, number of units), and the conv2D layer output format is a 4D array, which is why we need a flatten layer before using a dense layer.

Using the above model we were able to achieve an accuracy of 98.67% which is not bad but can be further improved with some tweaks.

Image Source: Author

References

If you like the article, I would appreciate it if you would leave a clap, in some silly way, it would keep me motivated to keep sharing my learnings!

Thanks for reading along.

--

--