Complete CNN Forward Pass Notes¶

What is CNN?¶

A Convolutional Neural Network (CNN):

  • Automatically learns features from images using:
    • Convolution
    • Activation
    • Pooling
    • Flattening
    • Dense layers

1. Input¶

  • Grayscale image of size: $$ N \times N $$
  • Example: $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} $$

2. Convolution¶

Process:¶

  • Use a filter (kernel) of size: $$ F \times F $$
  • Slide it over the image with stride S (typically 1).
  • For each position:
    • Take element-wise multiplication: $$ \text{Region} \times \text{Kernel} $$
    • Sum all products to get a scalar for that position in the feature map.

Output Size:¶

$$ O = \left\lfloor \frac{N - F}{S} \right\rfloor + 1 $$

Example Calculation:¶

For a 3x3 image, 2x2 filter, stride 1: $$ O = \left\lfloor \frac{3 - 2}{1} \right\rfloor + 1 = 2 $$ Feature map will be: $$ 2 \times 2 $$


Activation (ReLU)¶

Apply: $$ \text{ReLU}(x) = \max(0, x) $$ to each element of the feature map to:

  • Introduce non-linearity
  • Zero out negative activations

Max Pooling¶

Purpose:¶

  • Downsamples while retaining important features.
  • Uses a pool size:

$P \times P$

with stride S:

  • S = P for non-overlapping pooling.
  • S = 1 for overlapping pooling.

Operation:¶

  • For each $P \times P$ block, take: $$ \max \left( \text{block values} \right) $$

Output Size:¶

$$ O = \left\lfloor \frac{N - P}{S} \right\rfloor + 1 $$


Flattening¶

  • Converts a $d \times d$ pooled feature map to: $$ \mathbb{R}^{d^2} $$
  • Example: $$ \begin{bmatrix} 5 & 6 \\ 8 & 9 \end{bmatrix} \rightarrow [5, 6, 8, 9] $$
  • Prepares for Dense layers.

Dense Layer (Fully Connected)¶

Maps the flattened vector to output classes:

  • Uses: $$ y = W x + b $$ where:
  • $ x $: flattened input
  • $ W $: weight matrix
  • $ b $: bias vector

Follow with activation:¶

  • ReLU for hidden layers.
  • Softmax for multi-class output: $$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$
  • Sigmoid for binary output: $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

Summary Flow:¶

Input Image ↓ Convolution (Feature Extraction) ↓ ReLU (Non-Linearity) ↓ Max Pooling (Downsampling) ↓ Flatten (1D Vector) ↓ Dense Layers (Classification)

Key Points:¶

Convolution: Local pattern detection.
ReLU: Adds non-linearity.
Pooling: Reduces spatial dimensions.
Flatten: Converts to 1D for Dense layers.
Dense: Maps features to outputs.

In [ ]:
import numpy as np

image = np.array([
    [1, 2, 0, 2, 1],
    [0, 1, 3, 1, 0],
    [2, 2, 1, 0, 1],
    [1, 0, 1, 3, 2],
    [0, 1, 2, 2, 1]
])


print(" Original Image:\n", image)

# Define a kernel (filter) for edge detection (simple vertical filter)
kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])
 Original Image:
 [[1 2 0 2 1]
 [0 1 3 1 0]
 [2 2 1 0 1]
 [1 0 1 3 2]
 [0 1 2 2 1]]
In [ ]:
output_shape = (image.shape[0]-kernel.shape[0]+1,image.shape[1]-kernel.shape[1]+1)
feature_map = np.zeros(output_shape)

print(output_shape,feature_map)
(3, 3) [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
In [ ]:
for i in range(output_shape[0]):
  for j in range(output_shape[1]):
    region = image[i:i+kernel.shape[0],j:j+kernel.shape[1]]
    feature_map[i,j]=np.sum(region*kernel)
print(feature_map)
[[ 1. -2. -2.]
 [ 2.  1. -2.]
 [ 1.  2.  0.]]
In [ ]:
feature_map_relu = np.maximum(0,feature_map)
print(feature_map_relu)
[[1. 0. 0.]
 [2. 1. 0.]
 [1. 2. 0.]]
In [ ]:
# Pooling parameters
filter_size = 2
stride = 1

# Calculate output shape for stride = 1
pooled_shape = (
    (feature_map_relu.shape[0] - filter_size) // stride + 1,
    (feature_map_relu.shape[1] - filter_size) // stride + 1
)

pooled = np.zeros(pooled_shape)
In [ ]:
for i in range(pooled_shape[0]):
    for j in range(pooled_shape[1]):
        region = feature_map_relu[i:i+filter_size, j:j+filter_size]
        pooled[i, j] = np.max(region)

print("\n After Max Pooling with stride=1:\n", pooled)
 After Max Pooling with stride=1:
 [[2. 1.]
 [2. 2.]]
In [ ]:
flattened = pooled.flatten()
print(flattened)
[2. 1. 2. 2.]