What is Deep Learning?¶
Deep Learning is:
- A subset of Machine Learning.
- Uses neural networks with multiple layers (deep architectures).
- Learns features + decision boundaries directly from data.
Why "Deep"?¶
- "Deep" = many hidden layers between input and output.
- More layers → ability to learn complex patterns and representations.
Key Components¶
Neural Networks:¶
- Composed of neurons (nodes) connected by weights.
- Layers:
- Input Layer: Takes features.
- Hidden Layers: Extract patterns.
- Output Layer: Provides predictions.
Activation Functions:¶
- Add non-linearity.
- Common:
relusigmoidtanhsoftmax(for multi-class classification)
Loss Functions:¶
- Measure prediction error.
- Examples:
categorical_crossentropy(multi-class)binary_crossentropy(binary)mse(regression)
4 Optimizers:¶
- Update weights to minimize loss.
- Examples:
SGDAdamRMSprop
Loss Functions¶
Loss functions measure how well your model predictions match true labels.
Classification:¶
- Binary Classification:
binary_crossentropy- Formula: $$ L = - [y \log(\hat{y}) + (1-y) \log(1-\hat{y})] $$
- Multi-class Classification:
categorical_crossentropy(one-hot labels)sparse_categorical_crossentropy(integer labels)
Regression:¶
- Mean Squared Error (MSE):
- Measures average squared difference.
- $$ L = \frac{1}{n} \sum_{i} (y_i - \hat{y}_i)^2 $$
- Mean Absolute Error (MAE):
- Measures average absolute difference.
Gradient Descent (GD)¶
Gradient Descent:
- An optimization algorithm to minimize the loss.
- Computes:
$$
w := w - \eta \frac{\partial L}{\partial w}
$$
where:
- $ w $ = parameters/weights
- $ \eta $ = learning rate
- $ \frac{\partial L}{\partial w} $ = gradient of loss
Key Points:¶
- Takes steps in the negative gradient direction.
- Repeats until convergence (loss stops decreasing).
Stochastic Gradient Descent (SGD)¶
SGD:
- Variant of GD.
- Updates weights using one data sample at a time (or small batches).
Difference from GD:¶
| GD | SGD |
|---|---|
| Uses all data per update | Uses one sample per update |
| Stable but slow | Faster updates, more noise |
| Needs large memory | Memory efficient |
Mini-batch SGD:¶
Uses small batches (e.g., 32, 64 samples) per update: Balance between stability (GD) and speed (SGD).
Why SGD is used in Deep Learning?¶
- Faster convergence on large datasets.
- Adds noise to help escape local minima.
- Scalable for large models and data.
Negative Log-Likelihood (NLL)¶
What is it?¶
Negative Log-Likelihood (NLL) is a loss function that measures how well predicted probabilities match true labels.
Formula¶
For one-hot labels: $$ L = - \sum_{i} y_i \cdot \log(\hat{y}_i) $$
where:
- $ y_i $ = true label (1 for correct class, 0 otherwise),
- $ \hat{y}_i $ = predicted probability for class $ i $.
Same as categorical_crossentropy in multi-class classification.
Why use NLL?¶
- Penalizes low probabilities for correct labels heavily.
- Encourages high confidence for correct predictions.
- Smooth, differentiable, ideal for gradient-based optimization.
Where used?¶
- Classification tasks
- Language models
- Any probabilistic deep learning task requiring likelihood maximization
Minimize NLL → Maximize your model's confidence on correct classes.
Learning Rate, Momentum, Dropout, and Regularization¶
Learning Rate (LR)¶
- Controls step size during optimization.
- Too high → diverges.
- Too low → slow convergence.
- Typical values:
0.1,0.01,0.001.
Tune using learning rate schedules or optimizers like Adam which adapt LR automatically.
Momentum¶
- Helps accelerate gradients in relevant directions and dampens oscillations.
- Adds a fraction of previous update to the current update: $$ v_t = \beta v_{t-1} + (1 - \beta) \nabla L $$
- Typical
momentumvalues:0.9,0.99.
Useful with SGD to speed up convergence.
Dropout¶
- Regularization technique to prevent overfitting.
- Randomly “drops” a fraction of neurons during training.
- Forces the network to learn redundant, robust representations.
✅Typical dropout rates: 0.2, 0.5.
Regularization¶
Adds a penalty to the loss function to prevent overfitting.
Types:¶
L1 Regularization (Lasso): Adds: $$ \lambda \sum |w_i| $$ Encourages sparsity (many weights → 0).
L2 Regularization (Ridge): Adds: $$ \lambda \sum w_i^2 $$ Encourages smaller weights, smooths the model.
Elastic Net: Combines L1 + L2.