Complete CNN Forward Pass Notes¶
What is CNN?¶
A Convolutional Neural Network (CNN):
- Automatically learns features from images using:
- Convolution
- Activation
- Pooling
- Flattening
- Dense layers
1. Input¶
- Grayscale image of size: $$ N \times N $$
- Example: $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} $$
2. Convolution¶
Process:¶
- Use a filter (kernel) of size: $$ F \times F $$
- Slide it over the image with stride
S(typically 1). - For each position:
- Take element-wise multiplication: $$ \text{Region} \times \text{Kernel} $$
- Sum all products to get a scalar for that position in the feature map.
Output Size:¶
$$ O = \left\lfloor \frac{N - F}{S} \right\rfloor + 1 $$
Example Calculation:¶
For a 3x3 image, 2x2 filter, stride 1:
$$
O = \left\lfloor \frac{3 - 2}{1} \right\rfloor + 1 = 2
$$
Feature map will be:
$$
2 \times 2
$$
Activation (ReLU)¶
Apply: $$ \text{ReLU}(x) = \max(0, x) $$ to each element of the feature map to:
- Introduce non-linearity
- Zero out negative activations
Max Pooling¶
Purpose:¶
- Downsamples while retaining important features.
- Uses a pool size:
$P \times P$
with stride S:
S = Pfor non-overlapping pooling.S = 1for overlapping pooling.
Operation:¶
- For each $P \times P$ block, take: $$ \max \left( \text{block values} \right) $$
Output Size:¶
$$ O = \left\lfloor \frac{N - P}{S} \right\rfloor + 1 $$
Flattening¶
- Converts a $d \times d$ pooled feature map to: $$ \mathbb{R}^{d^2} $$
- Example: $$ \begin{bmatrix} 5 & 6 \\ 8 & 9 \end{bmatrix} \rightarrow [5, 6, 8, 9] $$
- Prepares for Dense layers.
Dense Layer (Fully Connected)¶
Maps the flattened vector to output classes:
- Uses: $$ y = W x + b $$ where:
- $ x $: flattened input
- $ W $: weight matrix
- $ b $: bias vector
Follow with activation:¶
ReLUfor hidden layers.Softmaxfor multi-class output: $$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$Sigmoidfor binary output: $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
Summary Flow:¶
Input Image ↓ Convolution (Feature Extraction) ↓ ReLU (Non-Linearity) ↓ Max Pooling (Downsampling) ↓ Flatten (1D Vector) ↓ Dense Layers (Classification)
Key Points:¶
Convolution: Local pattern detection.
ReLU: Adds non-linearity.
Pooling: Reduces spatial dimensions.
Flatten: Converts to 1D for Dense layers.
Dense: Maps features to outputs.
import numpy as np
image = np.array([
[1, 2, 0, 2, 1],
[0, 1, 3, 1, 0],
[2, 2, 1, 0, 1],
[1, 0, 1, 3, 2],
[0, 1, 2, 2, 1]
])
print(" Original Image:\n", image)
# Define a kernel (filter) for edge detection (simple vertical filter)
kernel = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
])
Original Image: [[1 2 0 2 1] [0 1 3 1 0] [2 2 1 0 1] [1 0 1 3 2] [0 1 2 2 1]]
output_shape = (image.shape[0]-kernel.shape[0]+1,image.shape[1]-kernel.shape[1]+1)
feature_map = np.zeros(output_shape)
print(output_shape,feature_map)
(3, 3) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
for i in range(output_shape[0]):
for j in range(output_shape[1]):
region = image[i:i+kernel.shape[0],j:j+kernel.shape[1]]
feature_map[i,j]=np.sum(region*kernel)
print(feature_map)
[[ 1. -2. -2.] [ 2. 1. -2.] [ 1. 2. 0.]]
feature_map_relu = np.maximum(0,feature_map)
print(feature_map_relu)
[[1. 0. 0.] [2. 1. 0.] [1. 2. 0.]]
# Pooling parameters
filter_size = 2
stride = 1
# Calculate output shape for stride = 1
pooled_shape = (
(feature_map_relu.shape[0] - filter_size) // stride + 1,
(feature_map_relu.shape[1] - filter_size) // stride + 1
)
pooled = np.zeros(pooled_shape)
for i in range(pooled_shape[0]):
for j in range(pooled_shape[1]):
region = feature_map_relu[i:i+filter_size, j:j+filter_size]
pooled[i, j] = np.max(region)
print("\n After Max Pooling with stride=1:\n", pooled)
After Max Pooling with stride=1: [[2. 1.] [2. 2.]]
flattened = pooled.flatten()
print(flattened)
[2. 1. 2. 2.]