Building Blocks of Computer Vision and CNN

10 min readAug 4, 2020

CHAPTER 1

As we are going to understand vision let us talk about Eye first. The biggest question that we can ask is, How eyes are able to see things ? .What is the mechanism behind ? What is happening inside brain to identify the image ?To understand the building blocks of CNN we need to understand this first .

The Human Eye

Internally our eyes see things by Rods and Cones shown in the figure . Rods are sensitive to low light intensity and able to detect shades of grey (black and white ). Rods can recognise shapes but not colours .where cones are sensitive to high light intensity and detect colour . Cones doesn’t operate in poor light . They are sensitive to red ,green,blue light. The combination of all these channels (output from rods and cones ) helps our brain(where all processing happens) to see things .

When ever we see an image it actually gets printed inside our head .Then the images breaks into edges and gradients. When neurons in our brain sees these specific edges and gradients of the images it electrifies . Those specific neurons combines together and makes textures and patterns and helps recognise the image. So understating in short we have 4 operations happens in our brain after an image get printed . They are extracting edges and gradients, textures and patterns ,part of objects and objects.

Image Dissection

In general images are having 3 channels i.e Red,Green,Blue (we can have more than 3 channels but in our case we will be using only 3 to understand .(rgb are the primary colours ). Given a coloured image of 100 pixels will have the shape of image as (10x10x3) , where 3 indicates number of channels and 10x10 represents number of pixels. If we zoom into any picture we can actually count the pixels. In CNN every pixel will provide some information at layer 1. Each channel will have the intensity of features (edges,gradients,textures,patterns) and combination of all the channels makes an image. Each channel (10x10x1) here is a collection features and gradients of an image .Where feature is the information about gradients and edges in the image (gradients are the direction of change in colour intensity edges are the sudden change in colour intensity)

Overview of CNN

When a picture get into the CNN first the network extracts the edges and gradients. From those it extracts textures and patterns. From textures and patterns it extracts part of objects. Finally it extracts objects. To do so we use Kernel and concept of receptive field.

What is a Kernel?

Kernel is the dude of our class. In CNN kernel is a matrix ,which extracts the features from the input images. The kernel moves over the image by performing dot product on a sub-region of the image pixel values and sum them to get the output. This processes is known as convolution (we will understand more on convolution).At first the kernel matrix is randomly initiated .It can’t be all 0s or 1s at first place .Kernels can have negetive numbers aswell but it should initialize with random numbers. Then by the help of back propagation the numbers get optimised to extract features like vertical edges, horizontal edges, gradients etc in the image. Different kernels detects different edges and gradients. Kernels are also used in blurring, sharpening, edge detection in diffent stages . It moves over the image by stride values provide .

Following code uses a kernel which works as edge extractor of the image.You can try these kernels and see What it is extracting.

np.float32([[-1,-1,-1],[0,0,0],[1,1,1]])
np.float32([[-1,-2, -1],[0,0,0],[1,2,1]]) 
np.float32([[-4,-1,-4],[0,0,0],[4,1,4]]) 
np.float32([[-4,-4,-4],[0,0,0],[4,4,4]])

The kernel can be of any size i.e., 3x3, 5x5, 7x7 etc. Most of the time we prefer to use a 3x3 kernels because using multiple 3x3 kernels we can reproduce effect of convolution operation produced by kernels of different sizes. But if we have high end GPU we can can use a 7x7 or 11x11 kernel (RestNet uses 7x7 at first layer).Defining the size of kernel again depends on memory,imge size and receptive field.

If we apply 3x3 kernel twice to get a final value, we actually use (3x3 + 3x3) weights. So, with smaller kernel sizes, we get lower number of weights and more number of layers. Due to this 3X3 kernels are computationally efficient and larger number of layers helps to extract complex and non-linear features with ease. For example:

The effect of convolution operation by 5x5 is replicated by convolving an image twice using two 3x3 kernels. similarly effect of 7x7 kernel by three 3x3 kernels and so on. The area kernel will be able to see is Receptive field (we will get to know in Receptive field section).

Because of these salient property of 3x3 kernels many of the algorithms and GPU computations are optimised for 3x3 kernels.

What is Receptive Field ?

Another important concept is Receptive Field . We need to know it in depth to build a good CNN architecture. To start with you can think receptive field as part of an object . if you see below image we can hardly understand what part of object it is .This happens because the receptive field is low ,but as soon as we increase the receptive field we will have an idea about the specific object.

Let’s play a small game to understand the concept better. Just Look at the animation and try to understand the actual image from them.

Initial channels are not giving any input to our brain to imagine the actual image.Later channels (higher receptive field) are helping gradually to reach to the final image.In the very first slide receptive field is small. After that once the receptive field increases we are able to figure out some channels like fingers, eyes,hairs etc . Gradually we understood that the image is of Indian aerospace scientist Dr.Avul Pakir Jainulabdeen Abdul Kalam .

The same way if a cnn architecture will not be able to see the features or textures or part of objects and it will not be able to recognise the image .In starting layers as the receptive field is small we dont have much infrmations about the input image apart from small features and textures. So it is important for the layers to understand / see the part of an image which helps our brain to recognise the image .This problem can be solved by adding more number of layers in it.

Receptive field can further divided in 2 parts

1> Local Receptive field

Local receptive field is present in every layer. Local receptive will be the size of kernel used in the layer .For example if we have an image of size 19x19 and we are applying a 3x3 metric then local receptive field will be 3x3 in first layer.

2> Global Receptive field

At every layer the part of image our kernel can see is global receptive field .For a 3x3 kernel convolution global receptive field will increase by 2 units ( there is a mathematical formula that we can cover in later chapters ). It means if you see the below code in every convolution step our model will be able to see 2 pixel more in each side of image .

Input image  =>  kernel shape => Output Image -> local Receptive field -> Global Receptive field 
19x19 => 3x3 => 17x17 -> 3x3 ->3x3
17x17 => 3x3 => 15x15 ->3x3 ->5x5
15x15 => 3x3 => 13x13 ->3x3 ->7x7
13x13 => 3x3 => 11x11 ->3x3 ->9x9
11x11 => 3x3 => 9x9 ->3x3 ->11x11
9x9 => 3x3 => 7x7 ->3x3 ->13x13
7x7 => 3x3 => 5x5 ->3x3 ->15x15
5x5 => 3x3 => 3x3 ->3x3 ->17x17
3x3 => 3x3 => 1x1 ->3x3 ->19x19

What is convolution ?

Convolution is a mathematical operation. To perform Convolution we need 2 metrics (Actually it’s Part of Image and Kernel) . Let’s focus on the below image

We have taken a 6x6 part of an image here . Consider each cell in excel sheet is the pixel value of an image. For the image size of 10x10 we will have such 100 pixels in it . We have used a kernel of size 3x3 which is randomly initialised .The value of kernel can be negative also. We multiply this kernel on top of image and sum them up . This process is known as convolution . You can see we have the kernel sliding around the image and convolving(The sliding gap in known as stride, here stride = 1 ) . In each convolution we are getting one value which represent a 3x3 part of the image(receptive field) . Going on we will be getting a 4x4 metric after convolving on 6x6 image with a kernel of shape 3x3.

Example:- 5x5 image convolving on 3x3 metrics

Each time we perform a convolution operation on image, the size of the image is reduced by 2x2. It means if we have a 5x5 metric to reach to 1x1 we need to go by 5x5 (convolution using 3x3 kernel) 3x3 (convolution using 3x3 kernel) 1x1 (Consider we are using 3x3 kernel always ).

Just think How many times do we need to perform 3x3 convolution operation to reach 1x1 from 199x199 size image ?

Calculation: Assuming the image is black and white

199 x 199 x 1 | 3x3x1| 197 x 197 x 1

197 x 197 x 1 | 3x3x1| 195 x 195 x 1

195 x 195 x 1 | 3x3x1| 193 x 193 x 1

5 x 5 x 1 | 3x3x1| 3 x 3 x 1 …… 98th convolution

3 x 3 x 1 | 3x3x1| 1 x 1 x 1. …… 99th convolution

Detailed Steps:( => this indicated a 3x3 kernel convolution )

199x199 =>, 197x197 =>, 195x195 =>, 193x193 =>, 191x191 =>, 189x189 =>, 187x187=>, 185x185 =>, 183x183 =>, 181x181 =>, 179x179 =>, 177x177 =>, 175x175 =>, 173x173=>, 171x171 =>, 169x169 =>, 167x167 =>, 165x165 =>, 163x163 =>, 161x161 =>, 159x159=>, 157x157 =>, 155x155 =>, 153x153 =>, 151x151 =>, 149x149 =>, 147x147 =>, 145x145=>, 143x143 =>, 141x141 =>, 139x139 =>, 137x137 =>, 135x135 =>, 133x133 =>, 131x131=>, 129x129 =>, 127x127 =>, 125x125 =>, 123x123 =>, 121x121 =>, 119x119 =>, 117x117=>, 115x115 =>, 113x113 =>, 111x111 =>, 109x109 =>, 107x107 =>, 105x105 =>, 103x103=>, 101x101 =>, 99x99 =>, 97x97 =>, 95x95 =>, 93x93 =>, 91x91 =>, 89x89 =>, 87x87 =>,85x85 =>, 83x83 =>, 81x81 =>, 79x79 =>, 77x77 =>, 75x75 =>, 73x73 =>, 71x71 =>, 69x69=>, 67x67 =>, 65x65 =>, 63x63 =>, 61x61 =>, 59x59 =>, 57x57 =>, 55x55 =>, 53x53 =>,51x51 =>, 49x49 =>, 47x47 =>, 45x45 =>, 43x43 =>, 41x41 =>, 39x39 =>, 37x37 =>, 35x35=>, 33x33 =>, 31x31 =>, 29x29 =>, 27x27 =>, 25x25 =>, 23x23 =>, 21x21 =>, 19x19 =>,17x17 =>, 15x15 =>, 13x13 =>, 11x11 =>, 9x9 =>, 7x7 =>, 5x5 =>, 3x3 =>, 1x1

What happened during training CNN (In brief )?

kernel is first initialised with some normally distributed random numbers. Then the kernel slides over the input image and does a summation on dot products on each slide. After every layer the kernel learns new features. During initial layers CNN learns about edges and gradients, textures and patterns in middle layers and in later layers CNN learns about parts of objects and complete objects. CNN can able to see all the objects once we reach required global receptive field.

Get a feel of CNN and play with it . https://poloclub.github.io/cnn-explainer/

For Example: Consider example of classifying handwritten digits. Then the layers of the CNN looks as below

Input layer:

We provide Training images and labels to the first layer. From here we start convolution operation on input images. During convolution operation kernel moves over the image by performing dot product on a sub-region of the image pixel values and sum them to get the output. The kernel matrix is randomly initialised.

Hidden layers:

Initial layers will extract edges and gradients,followed by textures and patterns in middle layers and in later layers CNN learns about parts of objects and complete objects. We have much more like Pooling ,Activation,1x1 ,kernel,Padding,Strides .I will cover them in upcoming posts.

Output layer:

If we are classifying 10 classes then the output layer will have the vector of size 10. This vector would be a one hot vector. For example in MINTS data set the label for 4 will be represented as [0,0,0,0,1,0,0,0,0,0].We have much more (Activation in dept,Epochs, bath , validation etc ) in this part but we will cover it in upcoming chapters .

To get a feel of it please visit the website which will help visualising the concepts . https://www.cs.ryerson.ca/~aharley/vis/conv/