LSTM + Skip-gram Summarizer and Quiz Generator: Essential Knowledge¶


What is this project?¶

Build a system that:

  • Uses LSTM with Skip-gram embeddings.
  • Performs text summarization.
  • Generates fill-in-the-blank MCQs from the summaries.

Why Skip-gram Embeddings?¶

  • Converts words into dense, meaningful vectors.
  • Words with similar meanings have similar vectors.
  • Helps LSTM learn context efficiently.

Why LSTM?¶

  • Handles sequential data (text) with memory.
  • Useful for sequence-to-sequence tasks like summarization.
  • Learns long-range dependencies in text.

Why Encoder-Decoder LSTM?¶

  • Summarization needs variable-length outputs.
  • Encoder:
    • Reads the entire input sentence.
    • Converts it into hidden and cell states (context).
  • Decoder:
    • Uses encoder’s context to generate the summary word-by-word.
    • Outputs sequence independent of input length.

Analogy:

  • Encoder: Reading and understanding a paragraph.
  • Decoder: Explaining it in your own words.

Pipeline Recap¶

  1. Preprocess text: clean, tokenize, pad.
  2. Train/load Skip-gram Word2Vec embeddings.
  3. Build embedding matrix for your tokenizer vocabulary.
  4. Encoder-Decoder LSTM:
    • Encoder: Embedding + LSTM → states.
    • Decoder: Embedding + LSTM (with encoder states) → Dense softmax.
  5. Train with teacher forcing for sequence generation.
  6. Inference:
    • Use encoder to get states.
    • Use decoder to generate summaries one word at a time.
  7. Generate MCQs:
    • Extract key terms (nouns/entities) from summaries.
    • Replace with blanks for fill-in-the-blank questions.
  8. Export as JSON for quiz use.

Key Terms¶

  • Tokenization: Splitting text into words/tokens.
  • Embedding: Converting words into numeric vectors.
  • Cosine Similarity: Measures how similar two vectors are.
  • Teacher Forcing: Using true previous word during training.
  • Inference: Generating output using model’s own predictions.

Why this project is valuable¶

  • Reinforces practical NLP (tokenization, embeddings).
  • Gives experience with sequence models (LSTM).
  • Builds a real, educational tool you can demo or deploy.
  • Teaches data preparation to generation pipeline fully.

In [1]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model

%pip install gensim
from gensim.models import Word2Vec
Requirement already satisfied: gensim in /usr/local/lib/python3.12/dist-packages (4.3.3)
Requirement already satisfied: numpy<2.0,>=1.18.5 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.26.4)
Requirement already satisfied: scipy<1.14.0,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.13.1)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.12/dist-packages (from gensim) (7.3.0.post1)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open>=1.8.1->gensim) (1.17.3)
In [2]:
import nltk
nltk.download('puntk')
[nltk_data] Error loading puntk: Package 'puntk' not found in index
Out[2]:
False

Data Preparation¶

In [3]:
texts = [
    "Photosynthesis is the process by which plants make their food using sunlight.",
    "Mitochondria are the powerhouse of the cell and produce energy.",
    "Water boils at 100 degrees Celsius under normal conditions."
]

summaries = [
    "Plants make food from sunlight.",
    "Mitochondria produce energy.",
    "Water boils at 100 degrees."
]
In [4]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts+summaries)

input_sequences = tokenizer.texts_to_sequences(texts)
target_sequences = tokenizer.texts_to_sequences(summaries)


max_input_len = max(len(seq) for seq in input_sequences)
max_target_len = max(len(seq) for seq in target_sequences)

encoder_input = pad_sequences(input_sequences, maxlen=max_input_len, padding='post')
decoder_input = pad_sequences(target_sequences, maxlen=max_target_len, padding='post')
In [5]:
print(tokenizer.word_index)
{'the': 1, 'plants': 2, 'make': 3, 'food': 4, 'sunlight': 5, 'mitochondria': 6, 'produce': 7, 'energy': 8, 'water': 9, 'boils': 10, 'at': 11, '100': 12, 'degrees': 13, 'photosynthesis': 14, 'is': 15, 'process': 16, 'by': 17, 'which': 18, 'their': 19, 'using': 20, 'are': 21, 'powerhouse': 22, 'of': 23, 'cell': 24, 'and': 25, 'celsius': 26, 'under': 27, 'normal': 28, 'conditions': 29, 'from': 30}
  • Counts how often each word appears in your dataset.
  • Gives smaller numbers to words used more often.
  • Gives larger numbers to less frequent words.
In [6]:
print(input_sequences)
print(target_sequences)
[[14, 15, 1, 16, 17, 18, 2, 3, 19, 4, 20, 5], [6, 21, 1, 22, 23, 1, 24, 25, 7, 8], [9, 10, 11, 12, 13, 26, 27, 28, 29]]
[[2, 3, 4, 30, 5], [6, 7, 8], [9, 10, 11, 12, 13]]
In [7]:
print(max_input_len,max_target_len)
12 5
In [8]:
print(encoder_input)
[[14 15  1 16 17 18  2  3 19  4 20  5]
 [ 6 21  1 22 23  1 24 25  7  8  0  0]
 [ 9 10 11 12 13 26 27 28 29  0  0  0]]
In [9]:
print(decoder_input)
[[ 2  3  4 30  5]
 [ 6  7  8  0  0]
 [ 9 10 11 12 13]]
  • Adds zeros to the end (padding='post') of each sequence so all are the same length.
  • Prepares batch-consistent arrays for your model.

Teacher forcing = giving the correct previous word to the decoder during training to help it learn sequence generation effectively.

In [10]:
decoder_target = np.zeros_like(decoder_input)
In [11]:
decoder_target[:,:-1]=decoder_input[:,1:]
In [12]:
decoder_target[:, -1] = 0
In [13]:
print(decoder_target)
[[ 3  4 30  5  0]
 [ 7  8  0  0  0]
 [10 11 12 13  0]]
In [14]:
from nltk.tokenize import word_tokenize
In [15]:
nltk.download('punkt_tab')
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Out[15]:
True
In [16]:
tokenized_texts = [nltk.word_tokenize(text.lower()) for text in texts + summaries]
w2v_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=2, min_count=1, sg=1, epochs=200)
In [17]:
print('cat'in w2v_model.wv) # check existance of vocab
False
In [18]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]
    else:
        embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,)) #The random vector fill is not strictly required but is best practice to let your model handle unknown words properly instead of treating them as padding.
In [19]:
print(embedding_matrix.shape)
(31, 50)
In [20]:
encoder_inputs = Input(shape=(max_input_len,))
In [21]:
enc_emb = Embedding(vocab_size,embedding_dim,weights=[embedding_matrix],trainable=False)(encoder_inputs)
In [22]:
encoder_lstm, state_h,state_c = LSTM(128,return_state=True)(enc_emb)
In [23]:
encoder_states = [state_h, state_c]
In [24]:
print (encoder_states)
[<KerasTensor shape=(None, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_3>, <KerasTensor shape=(None, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_4>]
In [25]:
decoder_inputs = Input(shape=(max_target_len,))
In [26]:
dec_emb = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)
In [27]:
decoder_lstm = LSTM(128, return_sequences=True, return_state=True)
In [28]:
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
# dont need state_h and state_c
In [29]:
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
In [30]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
In [31]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
In [32]:
model.summary()
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer         │ (None, 12)        │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ input_layer_1       │ (None, 5)         │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ embedding           │ (None, 12, 50)    │      1,550 │ input_layer[0][0] │
│ (Embedding)         │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ embedding_1         │ (None, 5, 50)     │      1,550 │ input_layer_1[0]… │
│ (Embedding)         │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lstm (LSTM)         │ [(None, 128),     │     91,648 │ embedding[0][0]   │
│                     │ (None, 128),      │            │                   │
│                     │ (None, 128)]      │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ lstm_1 (LSTM)       │ [(None, 5, 128),  │     91,648 │ embedding_1[0][None, 128),      │            │ lstm[0][1],       │
│                     │ (None, 128)]      │            │ lstm[0][2]        │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense (Dense)       │ (None, 5, 31)     │      3,999 │ lstm_1[0][
    
    
 Total params: 190,395 (743.73 KB)
 Trainable params: 187,295 (731.62 KB)
 Non-trainable params: 3,100 (12.11 KB)
In [33]:
model.fit([encoder_input, decoder_input], decoder_target[..., np.newaxis], epochs=200, batch_size=2)
Epoch 1/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 4s 65ms/step - loss: 3.4336
Epoch 2/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.4154
Epoch 3/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 3.3949
Epoch 4/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.3680
Epoch 5/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 3.3198
Epoch 6/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 66ms/step - loss: 3.2414
Epoch 7/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.0982
Epoch 8/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 58ms/step - loss: 2.8278
Epoch 9/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 2.4242
Epoch 10/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 58ms/step - loss: 2.5028
Epoch 11/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 67ms/step - loss: 2.3007
Epoch 12/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 77ms/step - loss: 2.1362
Epoch 13/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 56ms/step - loss: 2.0685
Epoch 14/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 63ms/step - loss: 2.0303
Epoch 15/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 61ms/step - loss: 2.0315
Epoch 16/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 62ms/step - loss: 2.0164
Epoch 17/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 63ms/step - loss: 1.9422
Epoch 18/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.9240
Epoch 19/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.7947
Epoch 20/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.7651
Epoch 21/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 1.7190 
Epoch 22/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6612
Epoch 23/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.7359
Epoch 24/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.6882 
Epoch 25/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.6485 
Epoch 26/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.6293
Epoch 27/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.5212 
Epoch 28/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6558 
Epoch 29/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.6253
Epoch 30/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6008
Epoch 31/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.5871
Epoch 32/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.4417 
Epoch 33/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.4243 
Epoch 34/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.5625
Epoch 35/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.5116
Epoch 36/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.4770 
Epoch 37/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.3488
Epoch 38/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.3314 
Epoch 39/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.3153
Epoch 40/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.3194 
Epoch 41/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.3880
Epoch 42/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.2776
Epoch 43/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 1.3610
Epoch 44/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.3346
Epoch 45/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.3082 
Epoch 46/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.2931
Epoch 47/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.1876
Epoch 48/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step - loss: 1.2723
Epoch 49/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 1.2457
Epoch 50/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.2027
Epoch 51/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.1663 
Epoch 52/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.1634
Epoch 53/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.2292
Epoch 54/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1911
Epoch 55/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1215 
Epoch 56/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1787
Epoch 57/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.1641
Epoch 58/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.1093
Epoch 59/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0806
Epoch 60/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 50ms/step - loss: 1.0559
Epoch 61/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0581
Epoch 62/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0646 
Epoch 63/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0659
Epoch 64/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 1.0846
Epoch 65/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0536
Epoch 66/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0950
Epoch 67/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.0754
Epoch 68/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.0333
Epoch 69/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 0.9860
Epoch 70/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.9845
Epoch 71/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step - loss: 0.9984
Epoch 72/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 56ms/step - loss: 0.9573
Epoch 73/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.9151
Epoch 74/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 0.9689
Epoch 75/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8786
Epoch 76/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.9255 
Epoch 77/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8822
Epoch 78/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.8541
Epoch 79/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 0.8320 
Epoch 80/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.8603
Epoch 81/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.8412
Epoch 82/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8426
Epoch 83/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 50ms/step - loss: 0.8388
Epoch 84/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.7452
Epoch 85/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.7084
Epoch 86/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.7794
Epoch 87/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.7496
Epoch 88/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6971
Epoch 89/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 43ms/step - loss: 0.6819
Epoch 90/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.7249
Epoch 91/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6757
Epoch 92/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.6379
Epoch 93/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.6525 
Epoch 94/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6183
Epoch 95/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5933 
Epoch 96/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.5743
Epoch 97/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 43ms/step - loss: 0.6035 
Epoch 98/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5442
Epoch 99/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5677
Epoch 100/200
2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5683 
Out[33]:
<keras.src.callbacks.history.History at 0x7d9bb4a4bf50>
In [34]:
# ==================== INFERENCE SETUP FOR SUMMARY GENERATION ====================

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

#  Build Encoder Inference Model
encoder_model = Model(encoder_inputs, encoder_states)

#  Build Decoder Inference Model
decoder_state_input_h = Input(shape=(128,))
decoder_state_input_c = Input(shape=(128,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)

#  Define decode_sequence function
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer.word_index.get('start', 1)  # fallback to 1 if 'start' not found

    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if index == sampled_token_index:
                sampled_word = word
                break

        if sampled_word == 'end' or sampled_word is None or len(decoded_sentence.split()) >= max_target_len:
            stop_condition = True
        else:
            decoded_sentence += ' ' + sampled_word

            target_seq = np.zeros((1, 1))
            target_seq[0, 0] = sampled_token_index

            states_value = [h, c]

    return decoded_sentence.strip()

#  Test Cell: Generate Summary
test_text = """Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth."""

test_seq = tokenizer.texts_to_sequences([test_text])
test_seq = pad_sequences(test_seq, maxlen=max_input_len, padding='post')

generated_summary = decode_sequence(test_seq)

print("Input Text:")
print(test_text)
print("\nGenerated Summary:")
print(generated_summary)

# ==================== END OF INFERENCE SETUP ====================
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 211ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 242ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 45ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step
Input Text:
Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.

Generated Summary:
make food from sunlight
In [ ]:
# ==================== MCQ GENERATION BLOCK ====================

!pip install nltk transformers --quiet

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline
import random

#  Extract keywords and generate blanks
def generate_mcq_from_text(text, num_questions=5):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    words = [w for w in words if w.isalpha() and len(w) > 4]
    words = list(set(words))

    if len(words) < num_questions:
        num_questions = len(words)

    selected_words = random.sample(words, num_questions)
    mcqs = []

    for word in selected_words:
        for sent in sentences:
            if word in sent:
                question = sent.replace(word, '______')
                mcqs.append({
                    'question': question,
                    'answer': word
                })
                break
    return mcqs

#  Distractor Generation using Masked Language Modeling (MLM)
fill_mask = pipeline('fill-mask', model='bert-base-uncased')

def add_distractors(mcqs, num_distractors=3):
    for mcq in mcqs:
        masked_sent = mcq['question'].replace('______', '[MASK]')
        predictions = fill_mask(masked_sent)
        distractors = []
        for pred in predictions:
            token = pred['token_str']
            if token.lower() != mcq['answer'].lower() and token.isalpha() and token not in distractors:
                distractors.append(token)
            if len(distractors) >= num_distractors:
                break
        mcq['options'] = distractors + [mcq['answer']]
        random.shuffle(mcq['options'])
    return mcqs

#  Usage Example
text = """
Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.
The mitochondria is the powerhouse of the cell, producing ATP for cellular activities.
Water boils at 100 degrees Celsius under normal atmospheric pressure.
"""

mcqs = generate_mcq_from_text(text, num_questions=3)
mcqs = add_distractors(mcqs)

#  Display MCQs
for idx, mcq in enumerate(mcqs, 1):
    print(f"\nQ{idx}: {mcq['question']}")
    for opt_idx, option in enumerate(mcq['options'], ord('A')):
        print(f"   {chr(opt_idx)}. {option}")
    print(f"Answer: {mcq['answer']}")

# ==================== END OF MCQ GENERATION BLOCK ====================
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
Device set to use cpu
Q1: Water boils at 100 degrees ______ under normal atmospheric pressure.
   A. c
   B. Celsius
   C. altitude
   D. elevation
Answer: Celsius

Q2: 
Photosynthesis allows ______ to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.
   A. plants
   B. bacteria
   C. humans
   D. organisms
Answer: plants

Q3: 
Photosynthesis allows plants to convert ______ into food, producing oxygen as a byproduct and supporting life on Earth.
   A. energy
   B. water
   C. sunlight
   D. oxygen
Answer: sunlight