Research Buddy v1.2 — Summary Adherence Scoring via AlignScore

Acknowledgement — Dr. Shibbir Ahmed , Asst. Professor, Texas State University — for suggesting this research work

Section 01

Introduction

Research Buddy was originally conceived as a personal productivity tool to streamline academic paper discovery and understanding. Version 1.1 delivered abstract extraction, multi-model classification, dual-pipeline keyword extraction and summarization (Transformer + Gemini), and a personal paper library — all in one integrated, cost-conscious platform.

However, a critical gap remained: while v1.1 could generate summaries, it could not validate them. Large language models and local Transformers are known to hallucinate — producing fluent but factually inconsistent content.

                New in v1.2 — Summary Adherence Scoring: Automatically evaluates the factual
                consistency of any generated summary against the full paper text, returning an interpretable score,
                confidence level, and per-sentence evidence breakdown.
            

This transforms Research Buddy from a generation tool into a reliability-focused research companion, aligned with principles of Explainable AI (XAI) and trustworthy NLP.

XAI & Trustworthiness Alignment

By delivering granular per-sentence alignment breakdowns with evidence chunks, the feature provides interpretable reasoning behind factual consistency scores, enabling users to audit LLM summaries transparently. The binary is_reliable flag and calibrated thresholds foster warranted trust, mitigating hallucination risks in Transformer/Gemini outputs while correlating strongly with human judgments on established benchmarks.

Section 02

AlignScore: Architecture, Training & Scoring

2.1 Motivation

Prior metrics — n-gram based (ROUGE, BLEU), NLI-based (SummaC), or QA-based (QAFactEval) — rely on narrow, task-specific training data and fail to generalize. AlignScore (Zha et al., 2023) proposes a unified alignment function trained on 4.7M examples from 7 well-established NLP tasks.

2.2 Architecture

AlignScore is built on RoBERTa with three output heads trained jointly:

3-Way

ALIGNED / CONTRADICT / NEUTRAL
(primary head)

Binary

ALIGNED / NOT-ALIGNED

Regression

Score in [0, 1]

2.3 Internal Model Architecture

The text pair (chunk, sentence) is fed into RoBERTa. The pooler_output — the [CLS] token embedding (768-dim) — is passed through the three linear heads. [CLS] = start token; [SEP] = separator.

Input: [CLS] chunk_text [SEP] summary_sentence [SEP]
        ↓ RoBERTa Transformer Blocks (12–24 layers)
        ↓ pooler_output = CLS token embedding (768-dim)
        ↓
       ├─ tri_layer Linear(768→3) → ALIGNED / CONTRADICT / NEUTRAL ⭐
       ├─ bin_layer Linear(768→2) → ALIGNED / NOT-ALIGNED
       └─ reg_layer Linear(768→1) → score ∈ [0, 1]

→ Full Architecture Breakdown

2.4 MLM in Synthetic Data Augmentation

The MLM (Masked Language Model) component is used only during training, not inference. Negative (NOT-ALIGNED) samples are synthetically generated by masking 25% of tokens in aligned pairs and using MLM to infill them — producing plausible but factually inconsistent pairs.

Key distinction: CLS pooling is the inference-time scoring mechanism. MLM is a training-time augmentation tool only.

2.5 Joint Loss Function

\[ L_{total} = \lambda_1 L_{3\text{-way}} + \lambda_2 L_{\text{bin}} + \lambda_3 L_{\text{reg}} \quad \text{where} \quad \lambda_1 = \lambda_2 = \lambda_3 = 1 \]

2.6 Training Dataset

NLP Task	Datasets	Label Type	Samples
NLI	SNLI, MultiNLI, Adversarial NLI, DocNLI	3-way / binary	~2M
Fact Verification	FEVER, VitaminC	3-way	~579k
Paraphrase	QQP, PAWS, WikiText-103	binary	~9M
STS	SICK, STS Benchmark	regression	~10k
Question Answering	SQuAD v2, RACE	binary	~481k
Info Retrieval	MS MARCO	binary	~5M
Summarization	WikiHow	binary	~157k
Total	15 datasets, 7 tasks		4.7M

2.7 Scoring Formula

Split paper into ~350-token chunks at sentence boundaries
Split summary into individual sentences
Score all chunk × sentence pairs
Take max score per summary sentence (best matching chunk)
Mean of all max scores = final AlignScore

\[ \text{ALIGNSCORE}(o,l) = \frac{1}{|l|} \sum_{j} \max_{i} \; \text{alignment}(o_i, l_j) \]

\(o_i\) = context chunks · \(l_j\) = claim sentences · alignment = P(ALIGNED) from tri_layer

2.8 Benchmark Performance

Benchmark	AlignScore-large	Previous SOTA
SummaC (AUC-ROC avg)	88.6%	84.6% (UniEval)
TRUE (AUC-ROC avg)	87.4%	86.3% (AlignScore-base)
Pearson Correlation	54.1%	48.6% (QAFactEval)

Section 03

Integration in Research Buddy

3.1 Design Philosophy

AlignScore runs locally via Hugging Face — no external API, consistent with the free-tier stack. Exposed via two FastAPI endpoints under the /adherence router.

3.2 API Routes

POST /adherence/score Simple Score

Returns overall score, confidence, and reliability flag.

// Input
{ "paper_text": "...", "summary_text": "..." }

// Output
{ "align_score": 0.682, "confidence": "Medium", "is_reliable": false }

POST /adherence/breakdown Detailed

Returns per-sentence breakdown with top-k supporting chunks.

// Input
{ "paper_text": "...", "summary_text": "...", "top_k": 3 }

// Output
{
  "align_score": 0.682, "num_context_chunks": 39,
  "num_claim_sentences": 4,
  "per_sentence": [
    { "sentence_index": 0, "best_chunk_score": 0.906,
      "top_k_chunks": [{"chunk_index":1,"score":0.906}, ...] }
  ]
}

3.3 Validation

Empty paper_text or summary_text → HTTP 400
top_k outside [1–10] → HTTP 400
Internal exception → HTTP 500

3.4 User Workflow

Upload PDF — abstract and full text extracted automatically

Dual Summaries Generated — Transformer + Gemini pipelines run in parallel

AlignScore Evaluation — each summary scored via /adherence/score

Breakdown on Demand — drill into /adherence/breakdown for per-sentence evidence

Library Save — select the more reliable summary to attach to paper collection

Section 04

Pseudocode Walkthrough

Each step traced with DeepInfer-flavored examples showing exactly how data flows through the pipeline.

Full Pseudocode Walkthrough

All 5 functions — chunk_text, get_confidence, inference_core, inference_per_example, get_align_score — traced step-by-step with DeepInfer examples.

chunk_text inference_core RoBERTa · 768-dim softmax → ALIGNED

Section 05

Case Study: DeepInfer Paper Evaluation

5.1 Paper Overview

"Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment" (Ahmed, Gao & Rajan, ICSE 2024) proposes DeepInfer — a technique to infer data preconditions from trained DNNs using weakest precondition calculus.

DNN Models

Real-World Datasets

0.98

PCC Correct Preds

3.27x

Faster than SOTA

5.2 Demo Video

5.3 Test Setup

Input: Full paper text → 39 context chunks
Summary: 4-sentence Gemini-generated summary
Parameters: top_k=3

5.4 Aggregate Results

68.2%

AlignScore

Medium

Confidence

False

Is Reliable

39 / 4

Chunks / Sents

5.5 Per-Sentence Breakdown

"DeepInfer proposes a novel technique to infer data preconditions from trained DNNs to assess trustworthiness of predictions on unseen data."

90.6%

Best: Chunk 1 (Abstract) · Top-3: Chunk 3 (74.4%), Chunk 5 (70.7%)

"The approach utilizes a new DNN abstraction and weakest precondition calculus, deriving rules to compute layer-wise preconditions backward from model output."

74.0%

Best: Chunk 3 (Motivation) · Top-3: Chunk 1 (67.4%), Chunk 8 (57.2%)

"Data precondition violations are highly correlated (pcc=0.88) with incorrect predictions; satisfaction strongly correlates (pcc=0.98) with correct ones."

53.2%

Best: Chunk 14 (Results) · Top-3: Chunk 12 (24.5%), Chunk 15 (23.3%)

"DeepInfer effectively identifies correct/incorrect predictions (recall 0.98, F-1 0.84) and demonstrates a 3.27x speed improvement over state-of-the-art."

55.3%

Best: Chunk 20 (Conclusion) · Top-3: Chunk 0 (33.6%), Chunk 28 (27.2%)

5.6 Analysis

Sentences describing core contributions (0–1) score strongly (74–90%), closely mirroring the abstract. Sentences summarizing quantitative results (2–3) score lower (53–55%), suggesting the Gemini summary paraphrased numerical claims at a level of abstraction that weakened chunk alignment.

The is_reliable: false flag correctly signals that while the summary captures the gist, it may not be faithful enough for citation or critical analysis.