MNIST Dataset¶

The MNIST dataset (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits used extensively in machine learning and computer vision tasks.


Key Features¶

  • Total Samples: 70,000 images
    • 60,000 for training
    • 10,000 for testing
  • Image Size: 28 × 28 pixels (grayscale)
  • Input Shape: 784-dimensional vectors (28×28 flattened)
  • Labels: Digits from 0 to 9

Why MNIST?¶

  • Ideal for learning classification, dimensionality reduction, and image processing
  • Small and fast for training simple ML models
  • Commonly used for benchmarking CNNs, PCA, SVM, etc.

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

mnist_data = fetch_openml('mnist_784', version=1, as_frame=False)

features = mnist_data.data
targets=mnist_data.target

train_img,test_img,train_lbl,test_lbl = train_test_split(features,targets,test_size=0.15,random_state=42)
normalizer = StandardScaler()
train_img, test_img = normalizer.fit_transform(train_img), normalizer.fit_transform(test_img)
In [ ]:
import matplotlib.pyplot as plt
import numpy as np

# Select unnormalized images for visualization
some_images = train_img[:12]  # already split
some_labels = train_lbl[:12]

# Reshape 784 → 28x28
some_images = features[:12].reshape(-1, 28, 28)
some_labels = targets[:12]

# Plot 12 samples in a grid
plt.figure(figsize=(10, 4))
for i in range(12):
    plt.subplot(2, 6, i + 1)
    plt.imshow(some_images[i], cmap='gray')
    plt.title(f"Label: {some_labels[i]}")
    plt.axis("off")

plt.suptitle("Sample MNIST Digits", fontsize=16)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
estimator =PCA(0.95)
In [ ]:
train_image_pca = estimator.fit_transform(train_img)
In [ ]:
print('so the number of features are now down from',train_img.shape[1],'to',train_image_pca.shape[1])
so the number of features are now down from 784 to 330
In [ ]:
print(np.sum(estimator.explained_variance_ratio_))
print("Number of components selected:", estimator.n_components_)
0.9501533209153584
Number of components selected: 330