LSTM + Skip-gram Summarizer and Quiz Generator: Essential Knowledge¶
What is this project?¶
Build a system that:
- Uses LSTM with Skip-gram embeddings.
- Performs text summarization.
- Generates fill-in-the-blank MCQs from the summaries.
Why Skip-gram Embeddings?¶
- Converts words into dense, meaningful vectors.
- Words with similar meanings have similar vectors.
- Helps LSTM learn context efficiently.
Why LSTM?¶
- Handles sequential data (text) with memory.
- Useful for sequence-to-sequence tasks like summarization.
- Learns long-range dependencies in text.
Why Encoder-Decoder LSTM?¶
- Summarization needs variable-length outputs.
- Encoder:
- Reads the entire input sentence.
- Converts it into hidden and cell states (context).
- Decoder:
- Uses encoder’s context to generate the summary word-by-word.
- Outputs sequence independent of input length.
Analogy:
- Encoder: Reading and understanding a paragraph.
- Decoder: Explaining it in your own words.
Pipeline Recap¶
- Preprocess text: clean, tokenize, pad.
- Train/load Skip-gram Word2Vec embeddings.
- Build embedding matrix for your tokenizer vocabulary.
- Encoder-Decoder LSTM:
- Encoder: Embedding + LSTM → states.
- Decoder: Embedding + LSTM (with encoder states) → Dense softmax.
- Train with teacher forcing for sequence generation.
- Inference:
- Use encoder to get states.
- Use decoder to generate summaries one word at a time.
- Generate MCQs:
- Extract key terms (nouns/entities) from summaries.
- Replace with blanks for fill-in-the-blank questions.
- Export as JSON for quiz use.
Key Terms¶
- Tokenization: Splitting text into words/tokens.
- Embedding: Converting words into numeric vectors.
- Cosine Similarity: Measures how similar two vectors are.
- Teacher Forcing: Using true previous word during training.
- Inference: Generating output using model’s own predictions.
Why this project is valuable¶
- Reinforces practical NLP (tokenization, embeddings).
- Gives experience with sequence models (LSTM).
- Builds a real, educational tool you can demo or deploy.
- Teaches data preparation to generation pipeline fully.
In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
%pip install gensim
from gensim.models import Word2Vec
Requirement already satisfied: gensim in /usr/local/lib/python3.12/dist-packages (4.3.3) Requirement already satisfied: numpy<2.0,>=1.18.5 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.26.4) Requirement already satisfied: scipy<1.14.0,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.13.1) Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.12/dist-packages (from gensim) (7.3.0.post1) Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open>=1.8.1->gensim) (1.17.3)
In [2]:
import nltk
nltk.download('puntk')
[nltk_data] Error loading puntk: Package 'puntk' not found in index
Out[2]:
False
Data Preparation¶
In [3]:
texts = [
"Photosynthesis is the process by which plants make their food using sunlight.",
"Mitochondria are the powerhouse of the cell and produce energy.",
"Water boils at 100 degrees Celsius under normal conditions."
]
summaries = [
"Plants make food from sunlight.",
"Mitochondria produce energy.",
"Water boils at 100 degrees."
]
In [4]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts+summaries)
input_sequences = tokenizer.texts_to_sequences(texts)
target_sequences = tokenizer.texts_to_sequences(summaries)
max_input_len = max(len(seq) for seq in input_sequences)
max_target_len = max(len(seq) for seq in target_sequences)
encoder_input = pad_sequences(input_sequences, maxlen=max_input_len, padding='post')
decoder_input = pad_sequences(target_sequences, maxlen=max_target_len, padding='post')
In [5]:
print(tokenizer.word_index)
{'the': 1, 'plants': 2, 'make': 3, 'food': 4, 'sunlight': 5, 'mitochondria': 6, 'produce': 7, 'energy': 8, 'water': 9, 'boils': 10, 'at': 11, '100': 12, 'degrees': 13, 'photosynthesis': 14, 'is': 15, 'process': 16, 'by': 17, 'which': 18, 'their': 19, 'using': 20, 'are': 21, 'powerhouse': 22, 'of': 23, 'cell': 24, 'and': 25, 'celsius': 26, 'under': 27, 'normal': 28, 'conditions': 29, 'from': 30}
- Counts how often each word appears in your dataset.
- Gives smaller numbers to words used more often.
- Gives larger numbers to less frequent words.
In [6]:
print(input_sequences)
print(target_sequences)
[[14, 15, 1, 16, 17, 18, 2, 3, 19, 4, 20, 5], [6, 21, 1, 22, 23, 1, 24, 25, 7, 8], [9, 10, 11, 12, 13, 26, 27, 28, 29]] [[2, 3, 4, 30, 5], [6, 7, 8], [9, 10, 11, 12, 13]]
In [7]:
print(max_input_len,max_target_len)
12 5
In [8]:
print(encoder_input)
[[14 15 1 16 17 18 2 3 19 4 20 5] [ 6 21 1 22 23 1 24 25 7 8 0 0] [ 9 10 11 12 13 26 27 28 29 0 0 0]]
In [9]:
print(decoder_input)
[[ 2 3 4 30 5] [ 6 7 8 0 0] [ 9 10 11 12 13]]
- Adds zeros to the end (
padding='post') of each sequence so all are the same length. - Prepares batch-consistent arrays for your model.
Teacher forcing = giving the correct previous word to the decoder during training to help it learn sequence generation effectively.
In [10]:
decoder_target = np.zeros_like(decoder_input)
In [11]:
decoder_target[:,:-1]=decoder_input[:,1:]
In [12]:
decoder_target[:, -1] = 0
In [13]:
print(decoder_target)
[[ 3 4 30 5 0] [ 7 8 0 0 0] [10 11 12 13 0]]
In [14]:
from nltk.tokenize import word_tokenize
In [15]:
nltk.download('punkt_tab')
[nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt_tab.zip.
Out[15]:
True
In [16]:
tokenized_texts = [nltk.word_tokenize(text.lower()) for text in texts + summaries]
w2v_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=2, min_count=1, sg=1, epochs=200)
In [17]:
print('cat'in w2v_model.wv) # check existance of vocab
False
In [18]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
if word in w2v_model.wv:
embedding_matrix[i] = w2v_model.wv[word]
else:
embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,)) #The random vector fill is not strictly required but is best practice to let your model handle unknown words properly instead of treating them as padding.
In [19]:
print(embedding_matrix.shape)
(31, 50)
In [20]:
encoder_inputs = Input(shape=(max_input_len,))
In [21]:
enc_emb = Embedding(vocab_size,embedding_dim,weights=[embedding_matrix],trainable=False)(encoder_inputs)
In [22]:
encoder_lstm, state_h,state_c = LSTM(128,return_state=True)(enc_emb)
In [23]:
encoder_states = [state_h, state_c]
In [24]:
print (encoder_states)
[<KerasTensor shape=(None, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_3>, <KerasTensor shape=(None, 128), dtype=float32, sparse=False, ragged=False, name=keras_tensor_4>]
In [25]:
decoder_inputs = Input(shape=(max_target_len,))
In [26]:
dec_emb = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)
In [27]:
decoder_lstm = LSTM(128, return_sequences=True, return_state=True)
In [28]:
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
# dont need state_h and state_c
In [29]:
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
In [30]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
In [31]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
In [32]:
model.summary()
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ input_layer │ (None, 12) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ input_layer_1 │ (None, 5) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ embedding │ (None, 12, 50) │ 1,550 │ input_layer[0][0] │ │ (Embedding) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ embedding_1 │ (None, 5, 50) │ 1,550 │ input_layer_1[0]… │ │ (Embedding) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ lstm (LSTM) │ [(None, 128), │ 91,648 │ embedding[0][0] │ │ │ (None, 128), │ │ │ │ │ (None, 128)] │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ lstm_1 (LSTM) │ [(None, 5, 128), │ 91,648 │ embedding_1[0][None, 128), │ │ lstm[0][1], │ │ │ (None, 128)] │ │ lstm[0][2] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense (Dense) │ (None, 5, 31) │ 3,999 │ lstm_1[0][Total params: 190,395 (743.73 KB)
Trainable params: 187,295 (731.62 KB)
Non-trainable params: 3,100 (12.11 KB)
In [33]:
model.fit([encoder_input, decoder_input], decoder_target[..., np.newaxis], epochs=200, batch_size=2)
Epoch 1/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 4s 65ms/step - loss: 3.4336 Epoch 2/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.4154 Epoch 3/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 3.3949 Epoch 4/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.3680 Epoch 5/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 3.3198 Epoch 6/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 66ms/step - loss: 3.2414 Epoch 7/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 3.0982 Epoch 8/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 58ms/step - loss: 2.8278 Epoch 9/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 57ms/step - loss: 2.4242 Epoch 10/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 58ms/step - loss: 2.5028 Epoch 11/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 67ms/step - loss: 2.3007 Epoch 12/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 77ms/step - loss: 2.1362 Epoch 13/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 56ms/step - loss: 2.0685 Epoch 14/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 63ms/step - loss: 2.0303 Epoch 15/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 61ms/step - loss: 2.0315 Epoch 16/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 62ms/step - loss: 2.0164 Epoch 17/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 63ms/step - loss: 1.9422 Epoch 18/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.9240 Epoch 19/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.7947 Epoch 20/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.7651 Epoch 21/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 1.7190 Epoch 22/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6612 Epoch 23/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.7359 Epoch 24/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.6882 Epoch 25/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.6485 Epoch 26/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.6293 Epoch 27/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.5212 Epoch 28/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6558 Epoch 29/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.6253 Epoch 30/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.6008 Epoch 31/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.5871 Epoch 32/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.4417 Epoch 33/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.4243 Epoch 34/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.5625 Epoch 35/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.5116 Epoch 36/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.4770 Epoch 37/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.3488 Epoch 38/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 55ms/step - loss: 1.3314 Epoch 39/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.3153 Epoch 40/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.3194 Epoch 41/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.3880 Epoch 42/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.2776 Epoch 43/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 1.3610 Epoch 44/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.3346 Epoch 45/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.3082 Epoch 46/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.2931 Epoch 47/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.1876 Epoch 48/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step - loss: 1.2723 Epoch 49/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 1.2457 Epoch 50/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.2027 Epoch 51/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.1663 Epoch 52/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.1634 Epoch 53/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.2292 Epoch 54/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1911 Epoch 55/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1215 Epoch 56/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.1787 Epoch 57/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.1641 Epoch 58/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step - loss: 1.1093 Epoch 59/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0806 Epoch 60/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 50ms/step - loss: 1.0559 Epoch 61/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0581 Epoch 62/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0646 Epoch 63/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0659 Epoch 64/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 1.0846 Epoch 65/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 1.0536 Epoch 66/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 1.0950 Epoch 67/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 1.0754 Epoch 68/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 1.0333 Epoch 69/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 0.9860 Epoch 70/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.9845 Epoch 71/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step - loss: 0.9984 Epoch 72/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 56ms/step - loss: 0.9573 Epoch 73/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.9151 Epoch 74/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 42ms/step - loss: 0.9689 Epoch 75/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8786 Epoch 76/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.9255 Epoch 77/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8822 Epoch 78/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.8541 Epoch 79/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 0.8320 Epoch 80/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.8603 Epoch 81/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.8412 Epoch 82/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.8426 Epoch 83/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 50ms/step - loss: 0.8388 Epoch 84/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.7452 Epoch 85/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.7084 Epoch 86/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.7794 Epoch 87/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.7496 Epoch 88/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6971 Epoch 89/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 43ms/step - loss: 0.6819 Epoch 90/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step - loss: 0.7249 Epoch 91/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6757 Epoch 92/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.6379 Epoch 93/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.6525 Epoch 94/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.6183 Epoch 95/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5933 Epoch 96/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step - loss: 0.5743 Epoch 97/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 43ms/step - loss: 0.6035 Epoch 98/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5442 Epoch 99/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5677 Epoch 100/200 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step - loss: 0.5683
Out[33]:
<keras.src.callbacks.history.History at 0x7d9bb4a4bf50>
In [34]:
# ==================== INFERENCE SETUP FOR SUMMARY GENERATION ====================
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
# Build Encoder Inference Model
encoder_model = Model(encoder_inputs, encoder_states)
# Build Decoder Inference Model
decoder_state_input_h = Input(shape=(128,))
decoder_state_input_c = Input(shape=(128,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
dec_emb2 = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)
# Define decode_sequence function
def decode_sequence(input_seq):
states_value = encoder_model.predict(input_seq)
target_seq = np.zeros((1, 1))
target_seq[0, 0] = tokenizer.word_index.get('start', 1) # fallback to 1 if 'start' not found
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_word = None
for word, index in tokenizer.word_index.items():
if index == sampled_token_index:
sampled_word = word
break
if sampled_word == 'end' or sampled_word is None or len(decoded_sentence.split()) >= max_target_len:
stop_condition = True
else:
decoded_sentence += ' ' + sampled_word
target_seq = np.zeros((1, 1))
target_seq[0, 0] = sampled_token_index
states_value = [h, c]
return decoded_sentence.strip()
# Test Cell: Generate Summary
test_text = """Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth."""
test_seq = tokenizer.texts_to_sequences([test_text])
test_seq = pad_sequences(test_seq, maxlen=max_input_len, padding='post')
generated_summary = decode_sequence(test_seq)
print("Input Text:")
print(test_text)
print("\nGenerated Summary:")
print(generated_summary)
# ==================== END OF INFERENCE SETUP ====================
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 211ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 242ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 45ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step Input Text: Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth. Generated Summary: make food from sunlight
In [ ]:
# ==================== MCQ GENERATION BLOCK ====================
!pip install nltk transformers --quiet
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline
import random
# Extract keywords and generate blanks
def generate_mcq_from_text(text, num_questions=5):
sentences = sent_tokenize(text)
words = word_tokenize(text)
words = [w for w in words if w.isalpha() and len(w) > 4]
words = list(set(words))
if len(words) < num_questions:
num_questions = len(words)
selected_words = random.sample(words, num_questions)
mcqs = []
for word in selected_words:
for sent in sentences:
if word in sent:
question = sent.replace(word, '______')
mcqs.append({
'question': question,
'answer': word
})
break
return mcqs
# Distractor Generation using Masked Language Modeling (MLM)
fill_mask = pipeline('fill-mask', model='bert-base-uncased')
def add_distractors(mcqs, num_distractors=3):
for mcq in mcqs:
masked_sent = mcq['question'].replace('______', '[MASK]')
predictions = fill_mask(masked_sent)
distractors = []
for pred in predictions:
token = pred['token_str']
if token.lower() != mcq['answer'].lower() and token.isalpha() and token not in distractors:
distractors.append(token)
if len(distractors) >= num_distractors:
break
mcq['options'] = distractors + [mcq['answer']]
random.shuffle(mcq['options'])
return mcqs
# Usage Example
text = """
Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.
The mitochondria is the powerhouse of the cell, producing ATP for cellular activities.
Water boils at 100 degrees Celsius under normal atmospheric pressure.
"""
mcqs = generate_mcq_from_text(text, num_questions=3)
mcqs = add_distractors(mcqs)
# Display MCQs
for idx, mcq in enumerate(mcqs, 1):
print(f"\nQ{idx}: {mcq['question']}")
for opt_idx, option in enumerate(mcq['options'], ord('A')):
print(f" {chr(opt_idx)}. {option}")
print(f"Answer: {mcq['answer']}")
# ==================== END OF MCQ GENERATION BLOCK ====================
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. /usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
config.json: 0%| | 0.00/570 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
tokenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
Device set to use cpu
Q1: Water boils at 100 degrees ______ under normal atmospheric pressure. A. c B. Celsius C. altitude D. elevation Answer: Celsius Q2: Photosynthesis allows ______ to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth. A. plants B. bacteria C. humans D. organisms Answer: plants Q3: Photosynthesis allows plants to convert ______ into food, producing oxygen as a byproduct and supporting life on Earth. A. energy B. water C. sunlight D. oxygen Answer: sunlight