Whitepaper · Research Buddy v1.2

Summary Adherence Scoring
via AlignScore

A new feature integrating factual consistency evaluation into Research Buddy's dual-pipeline summarization workflow

Nur A Jaman  ·  February 2026  ·  turjo-jaman.com
Acknowledgement — Dr. Shibbir Ahmed , Asst. Professor, Texas State University — for suggesting this research work

Introduction

Research Buddy was originally conceived as a personal productivity tool to streamline academic paper discovery and understanding. Version 1.1 delivered abstract extraction, multi-model classification, dual-pipeline keyword extraction and summarization (Transformer + Gemini), and a personal paper library — all in one integrated, cost-conscious platform.

However, a critical gap remained: while v1.1 could generate summaries, it could not validate them. Large language models and local Transformers are known to hallucinate — producing fluent but factually inconsistent content.

New in v1.2 — Summary Adherence Scoring: Automatically evaluates the factual consistency of any generated summary against the full paper text, returning an interpretable score, confidence level, and per-sentence evidence breakdown.

This transforms Research Buddy from a generation tool into a reliability-focused research companion, aligned with principles of Explainable AI (XAI) and trustworthy NLP.

XAI & Trustworthiness Alignment

By delivering granular per-sentence alignment breakdowns with evidence chunks, the feature provides interpretable reasoning behind factual consistency scores, enabling users to audit LLM summaries transparently. The binary is_reliable flag and calibrated thresholds foster warranted trust, mitigating hallucination risks in Transformer/Gemini outputs while correlating strongly with human judgments on established benchmarks.

AlignScore: Architecture, Training & Scoring

2.1 Motivation

Prior metrics — n-gram based (ROUGE, BLEU), NLI-based (SummaC), or QA-based (QAFactEval) — rely on narrow, task-specific training data and fail to generalize. AlignScore (Zha et al., 2023) proposes a unified alignment function trained on 4.7M examples from 7 well-established NLP tasks.

2.2 Architecture

AlignScore is built on RoBERTa with three output heads trained jointly:

3-Way
ALIGNED / CONTRADICT / NEUTRAL
(primary head)
Binary
ALIGNED / NOT-ALIGNED
Regression
Score in [0, 1]

2.3 Internal Model Architecture

The text pair (chunk, sentence) is fed into RoBERTa. The pooler_output — the [CLS] token embedding (768-dim) — is passed through the three linear heads. [CLS] = start token; [SEP] = separator.

Input:  [CLS] chunk_text [SEP] summary_sentence [SEP]
        ↓ RoBERTa Transformer Blocks (12–24 layers)
        ↓ pooler_output = CLS token embedding (768-dim)
        ↓
       ├─ tri_layer  Linear(768→3)  → ALIGNED / CONTRADICT / NEUTRAL  ⭐
       ├─ bin_layer  Linear(768→2)  → ALIGNED / NOT-ALIGNED
       └─ reg_layer  Linear(768→1)  → score ∈ [0, 1]

Full Architecture Breakdown

2.4 MLM in Synthetic Data Augmentation

2.4 MLM in Synthetic Data Augmentation

The MLM (Masked Language Model) component is used only during training, not inference. Negative (NOT-ALIGNED) samples are synthetically generated by masking 25% of tokens in aligned pairs and using MLM to infill them — producing plausible but factually inconsistent pairs.

Key distinction: CLS pooling is the inference-time scoring mechanism. MLM is a training-time augmentation tool only.

2.5 Joint Loss Function

\[ L_{total} = \lambda_1 L_{3\text{-way}} + \lambda_2 L_{\text{bin}} + \lambda_3 L_{\text{reg}} \quad \text{where} \quad \lambda_1 = \lambda_2 = \lambda_3 = 1 \]

2.6 Training Dataset

NLP Task Datasets Label Type Samples
NLI SNLI, MultiNLI, Adversarial NLI, DocNLI 3-way / binary ~2M
Fact Verification FEVER, VitaminC 3-way ~579k
Paraphrase QQP, PAWS, WikiText-103 binary ~9M
STS SICK, STS Benchmark regression ~10k
Question Answering SQuAD v2, RACE binary ~481k
Info Retrieval MS MARCO binary ~5M
Summarization WikiHow binary ~157k
Total 15 datasets, 7 tasks 4.7M

2.7 Scoring Formula

  1. Split paper into ~350-token chunks at sentence boundaries
  2. Split summary into individual sentences
  3. Score all chunk × sentence pairs
  4. Take max score per summary sentence (best matching chunk)
  5. Mean of all max scores = final AlignScore
\[ \text{ALIGNSCORE}(o,l) = \frac{1}{|l|} \sum_{j} \max_{i} \; \text{alignment}(o_i, l_j) \]

\(o_i\) = context chunks  ·  \(l_j\) = claim sentences  ·  alignment = P(ALIGNED) from tri_layer

2.8 Benchmark Performance

Benchmark AlignScore-large Previous SOTA
SummaC (AUC-ROC avg) 88.6% 84.6% (UniEval)
TRUE (AUC-ROC avg) 87.4% 86.3% (AlignScore-base)
Pearson Correlation 54.1% 48.6% (QAFactEval)

Integration in Research Buddy

3.1 Design Philosophy

AlignScore runs locally via Hugging Face — no external API, consistent with the free-tier stack. Exposed via two FastAPI endpoints under the /adherence router.

3.2 API Routes

POST /adherence/score Simple Score

Returns overall score, confidence, and reliability flag.

// Input
{ "paper_text": "...", "summary_text": "..." }

// Output
{ "align_score": 0.682, "confidence": "Medium", "is_reliable": false }
POST /adherence/breakdown Detailed

Returns per-sentence breakdown with top-k supporting chunks.

// Input
{ "paper_text": "...", "summary_text": "...", "top_k": 3 }

// Output
{
  "align_score": 0.682, "num_context_chunks": 39,
  "num_claim_sentences": 4,
  "per_sentence": [
    { "sentence_index": 0, "best_chunk_score": 0.906,
      "top_k_chunks": [{"chunk_index":1,"score":0.906}, ...] }
  ]
}

3.3 Validation

3.4 User Workflow

1
Upload PDF — abstract and full text extracted automatically
2
Dual Summaries Generated — Transformer + Gemini pipelines run in parallel
3
AlignScore Evaluation — each summary scored via /adherence/score
4
Breakdown on Demand — drill into /adherence/breakdown for per-sentence evidence
5
Library Save — select the more reliable summary to attach to paper collection

Pseudocode Walkthrough

Each step traced with DeepInfer-flavored examples showing exactly how data flows through the pipeline.

Case Study: DeepInfer Paper Evaluation

5.1 Paper Overview

"Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment" (Ahmed, Gao & Rajan, ICSE 2024) proposes DeepInfer — a technique to infer data preconditions from trained DNNs using weakest precondition calculus.

29
DNN Models
4
Real-World Datasets
0.98
PCC Correct Preds
3.27x
Faster than SOTA

5.2 Demo Video

Replace src="YOUR_VIDEO_FILE.mp4" with your video filename

5.3 Test Setup

5.4 Aggregate Results

68.2%
AlignScore
Medium
Confidence
False
Is Reliable
39 / 4
Chunks / Sents

5.5 Per-Sentence Breakdown

0
"DeepInfer proposes a novel technique to infer data preconditions from trained DNNs to assess trustworthiness of predictions on unseen data."
90.6%
Best: Chunk 1 (Abstract) · Top-3: Chunk 3 (74.4%), Chunk 5 (70.7%)
1
"The approach utilizes a new DNN abstraction and weakest precondition calculus, deriving rules to compute layer-wise preconditions backward from model output."
74.0%
Best: Chunk 3 (Motivation) · Top-3: Chunk 1 (67.4%), Chunk 8 (57.2%)
2
"Data precondition violations are highly correlated (pcc=0.88) with incorrect predictions; satisfaction strongly correlates (pcc=0.98) with correct ones."
53.2%
Best: Chunk 14 (Results) · Top-3: Chunk 12 (24.5%), Chunk 15 (23.3%)
3
"DeepInfer effectively identifies correct/incorrect predictions (recall 0.98, F-1 0.84) and demonstrates a 3.27x speed improvement over state-of-the-art."
55.3%
Best: Chunk 20 (Conclusion) · Top-3: Chunk 0 (33.6%), Chunk 28 (27.2%)

5.6 Analysis

Sentences describing core contributions (0–1) score strongly (74–90%), closely mirroring the abstract. Sentences summarizing quantitative results (2–3) score lower (53–55%), suggesting the Gemini summary paraphrased numerical claims at a level of abstraction that weakened chunk alignment.

The is_reliable: false flag correctly signals that while the summary captures the gist, it may not be faithful enough for citation or critical analysis.