What is Deep Learning?¶

Deep Learning is:

A subset of Machine Learning.
Uses neural networks with multiple layers (deep architectures).
Learns features + decision boundaries directly from data.

Why "Deep"?¶

"Deep" = many hidden layers between input and output.
More layers → ability to learn complex patterns and representations.

Key Components¶

Neural Networks:¶

Composed of neurons (nodes) connected by weights.
Layers:
- Input Layer: Takes features.
- Hidden Layers: Extract patterns.
- Output Layer: Provides predictions.

Activation Functions:¶

Add non-linearity.
Common:
- relu
- sigmoid
- tanh
- softmax (for multi-class classification)

Loss Functions:¶

Measure prediction error.
Examples:
- categorical_crossentropy (multi-class)
- binary_crossentropy (binary)
- mse (regression)

4 Optimizers:¶

Update weights to minimize loss.
Examples:
- SGD
- Adam
- RMSprop

Loss Functions¶

Loss functions measure how well your model predictions match true labels.

Classification:¶

Binary Classification:
- binary_crossentropy
- Formula: $$ L = - [y \log(\hat{y}) + (1-y) \log(1-\hat{y})] $$
Multi-class Classification:
- categorical_crossentropy (one-hot labels)
- sparse_categorical_crossentropy (integer labels)

Regression:¶

Mean Squared Error (MSE):
- Measures average squared difference.
- $$ L = \frac{1}{n} \sum_{i} (y_i - \hat{y}_i)^2 $$
Mean Absolute Error (MAE):
- Measures average absolute difference.

Gradient Descent (GD)¶

Gradient Descent:

An optimization algorithm to minimize the loss.
Computes: $$ w := w - \eta \frac{\partial L}{\partial w} $$ where:
- $ w $ = parameters/weights
- $ \eta $ = learning rate
- $ \frac{\partial L}{\partial w} $ = gradient of loss

Key Points:¶

Takes steps in the negative gradient direction.
Repeats until convergence (loss stops decreasing).

Stochastic Gradient Descent (SGD)¶

SGD:

Variant of GD.
Updates weights using one data sample at a time (or small batches).

Difference from GD:¶

GD	SGD
Uses all data per update	Uses one sample per update
Stable but slow	Faster updates, more noise
Needs large memory	Memory efficient

Mini-batch SGD:¶

Uses small batches (e.g., 32, 64 samples) per update: Balance between stability (GD) and speed (SGD).

Why SGD is used in Deep Learning?¶

Faster convergence on large datasets.
Adds noise to help escape local minima.
Scalable for large models and data.

Negative Log-Likelihood (NLL)¶

What is it?¶

Negative Log-Likelihood (NLL) is a loss function that measures how well predicted probabilities match true labels.

Formula¶

For one-hot labels: $$ L = - \sum_{i} y_i \cdot \log(\hat{y}_i) $$

where:

$ y_i $ = true label (1 for correct class, 0 otherwise),
$ \hat{y}_i $ = predicted probability for class $ i $.

Same as categorical_crossentropy in multi-class classification.

Why use NLL?¶

Penalizes low probabilities for correct labels heavily.
Encourages high confidence for correct predictions.
Smooth, differentiable, ideal for gradient-based optimization.

Where used?¶

Classification tasks
Language models
Any probabilistic deep learning task requiring likelihood maximization

Minimize NLL → Maximize your model's confidence on correct classes.

Learning Rate, Momentum, Dropout, and Regularization¶

Learning Rate (LR)¶

Controls step size during optimization.
Too high → diverges.
Too low → slow convergence.
Typical values: 0.1, 0.01, 0.001.

Tune using learning rate schedules or optimizers like Adam which adapt LR automatically.

Momentum¶

Helps accelerate gradients in relevant directions and dampens oscillations.
Adds a fraction of previous update to the current update: $$ v_t = \beta v_{t-1} + (1 - \beta) \nabla L $$
Typical momentum values: 0.9, 0.99.

Useful with SGD to speed up convergence.

Dropout¶

Regularization technique to prevent overfitting.
Randomly “drops” a fraction of neurons during training.
Forces the network to learn redundant, robust representations.

✅Typical dropout rates: 0.2, 0.5.

Regularization¶

Adds a penalty to the loss function to prevent overfitting.

Types:¶

L1 Regularization (Lasso): Adds: $$ \lambda \sum |w_i| $$ Encourages sparsity (many weights → 0).
L2 Regularization (Ridge): Adds: $$ \lambda \sum w_i^2 $$ Encourages smaller weights, smooths the model.
Elastic Net: Combines L1 + L2.

In [ ]: