MNIST Dataset¶
The MNIST dataset (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits used extensively in machine learning and computer vision tasks.
Key Features¶
- Total Samples: 70,000 images
- 60,000 for training
- 10,000 for testing
- Image Size: 28 × 28 pixels (grayscale)
- Input Shape: 784-dimensional vectors (28×28 flattened)
- Labels: Digits from 0 to 9
Why MNIST?¶
- Ideal for learning classification, dimensionality reduction, and image processing
- Small and fast for training simple ML models
- Commonly used for benchmarking CNNs, PCA, SVM, etc.
In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
mnist_data = fetch_openml('mnist_784', version=1, as_frame=False)
features = mnist_data.data
targets=mnist_data.target
train_img,test_img,train_lbl,test_lbl = train_test_split(features,targets,test_size=0.15,random_state=42)
normalizer = StandardScaler()
train_img, test_img = normalizer.fit_transform(train_img), normalizer.fit_transform(test_img)
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
# Select unnormalized images for visualization
some_images = train_img[:12] # already split
some_labels = train_lbl[:12]
# Reshape 784 → 28x28
some_images = features[:12].reshape(-1, 28, 28)
some_labels = targets[:12]
# Plot 12 samples in a grid
plt.figure(figsize=(10, 4))
for i in range(12):
plt.subplot(2, 6, i + 1)
plt.imshow(some_images[i], cmap='gray')
plt.title(f"Label: {some_labels[i]}")
plt.axis("off")
plt.suptitle("Sample MNIST Digits", fontsize=16)
plt.tight_layout()
plt.show()
In [ ]:
estimator =PCA(0.95)
In [ ]:
train_image_pca = estimator.fit_transform(train_img)
In [ ]:
print('so the number of features are now down from',train_img.shape[1],'to',train_image_pca.shape[1])
so the number of features are now down from 784 to 330
In [ ]:
print(np.sum(estimator.explained_variance_ratio_))
print("Number of components selected:", estimator.n_components_)
0.9501533209153584 Number of components selected: 330