Introduction
Research Buddy was originally conceived as a personal productivity tool to streamline academic paper discovery and understanding. Version 1.1 delivered abstract extraction, multi-model classification, dual-pipeline keyword extraction and summarization (Transformer + Gemini), and a personal paper library — all in one integrated, cost-conscious platform.
However, a critical gap remained: while v1.1 could generate summaries, it could not validate them. Large language models and local Transformers are known to hallucinate — producing fluent but factually inconsistent content.
This transforms Research Buddy from a generation tool into a reliability-focused research companion, aligned with principles of Explainable AI (XAI) and trustworthy NLP.
XAI & Trustworthiness Alignment
By delivering granular per-sentence alignment breakdowns with evidence chunks, the feature provides
interpretable reasoning behind factual consistency scores, enabling users to audit LLM summaries
transparently. The binary is_reliable flag and calibrated thresholds foster warranted
trust, mitigating hallucination risks in Transformer/Gemini outputs while correlating strongly with
human judgments on established benchmarks.
AlignScore: Architecture, Training & Scoring
2.1 Motivation
Prior metrics — n-gram based (ROUGE, BLEU), NLI-based (SummaC), or QA-based (QAFactEval) — rely on narrow, task-specific training data and fail to generalize. AlignScore (Zha et al., 2023) proposes a unified alignment function trained on 4.7M examples from 7 well-established NLP tasks.
2.2 Architecture
AlignScore is built on RoBERTa with three output heads trained jointly:
(primary head)
2.3 Internal Model Architecture
The text pair (chunk, sentence) is fed into RoBERTa. The pooler_output —
the [CLS] token embedding (768-dim) — is passed through the three linear heads.
[CLS] = start token; [SEP] = separator.
↓ RoBERTa Transformer Blocks (12–24 layers)
↓ pooler_output = CLS token embedding (768-dim)
↓
├─ tri_layer Linear(768→3) → ALIGNED / CONTRADICT / NEUTRAL ⭐
├─ bin_layer Linear(768→2) → ALIGNED / NOT-ALIGNED
└─ reg_layer Linear(768→1) → score ∈ [0, 1]
2.4 MLM in Synthetic Data Augmentation
2.4 MLM in Synthetic Data Augmentation
The MLM (Masked Language Model) component is used only during training, not inference. Negative (NOT-ALIGNED) samples are synthetically generated by masking 25% of tokens in aligned pairs and using MLM to infill them — producing plausible but factually inconsistent pairs.
2.5 Joint Loss Function
2.6 Training Dataset
| NLP Task | Datasets | Label Type | Samples |
|---|---|---|---|
| NLI | SNLI, MultiNLI, Adversarial NLI, DocNLI | 3-way / binary | ~2M |
| Fact Verification | FEVER, VitaminC | 3-way | ~579k |
| Paraphrase | QQP, PAWS, WikiText-103 | binary | ~9M |
| STS | SICK, STS Benchmark | regression | ~10k |
| Question Answering | SQuAD v2, RACE | binary | ~481k |
| Info Retrieval | MS MARCO | binary | ~5M |
| Summarization | WikiHow | binary | ~157k |
| Total | 15 datasets, 7 tasks | 4.7M | |
2.7 Scoring Formula
- Split paper into ~350-token chunks at sentence boundaries
- Split summary into individual sentences
- Score all chunk × sentence pairs
- Take max score per summary sentence (best matching chunk)
- Mean of all max scores = final AlignScore
\(o_i\) = context chunks · \(l_j\) = claim sentences · alignment = P(ALIGNED) from tri_layer
2.8 Benchmark Performance
| Benchmark | AlignScore-large | Previous SOTA |
|---|---|---|
| SummaC (AUC-ROC avg) | 88.6% | 84.6% (UniEval) |
| TRUE (AUC-ROC avg) | 87.4% | 86.3% (AlignScore-base) |
| Pearson Correlation | 54.1% | 48.6% (QAFactEval) |
Integration in Research Buddy
3.1 Design Philosophy
AlignScore runs locally via Hugging Face — no external API, consistent with the free-tier stack.
Exposed via two FastAPI endpoints under the /adherence router.
3.2 API Routes
Returns overall score, confidence, and reliability flag.
// Input
{ "paper_text": "...", "summary_text": "..." }
// Output
{ "align_score": 0.682, "confidence": "Medium", "is_reliable": false }
Returns per-sentence breakdown with top-k supporting chunks.
// Input
{ "paper_text": "...", "summary_text": "...", "top_k": 3 }
// Output
{
"align_score": 0.682, "num_context_chunks": 39,
"num_claim_sentences": 4,
"per_sentence": [
{ "sentence_index": 0, "best_chunk_score": 0.906,
"top_k_chunks": [{"chunk_index":1,"score":0.906}, ...] }
]
}
3.3 Validation
- Empty
paper_textorsummary_text→ HTTP 400 top_koutside [1–10] → HTTP 400- Internal exception → HTTP 500
3.4 User Workflow
/adherence/score
/adherence/breakdown for per-sentence evidence
Pseudocode Walkthrough
Each step traced with DeepInfer-flavored examples showing exactly how data flows through the pipeline.
chunk_text, get_confidence,
inference_core, inference_per_example,
get_align_score — traced step-by-step with DeepInfer examples.
Case Study: DeepInfer Paper Evaluation
5.1 Paper Overview
"Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment" (Ahmed, Gao & Rajan, ICSE 2024) proposes DeepInfer — a technique to infer data preconditions from trained DNNs using weakest precondition calculus.
5.2 Demo Video
Replace src="YOUR_VIDEO_FILE.mp4" with your video filename
5.3 Test Setup
- Input: Full paper text → 39 context chunks
- Summary: 4-sentence Gemini-generated summary
- Parameters:
top_k=3
5.4 Aggregate Results
5.5 Per-Sentence Breakdown
5.6 Analysis
Sentences describing core contributions (0–1) score strongly (74–90%), closely mirroring the abstract. Sentences summarizing quantitative results (2–3) score lower (53–55%), suggesting the Gemini summary paraphrased numerical claims at a level of abstraction that weakened chunk alignment.
The is_reliable: false flag correctly signals that while the summary captures the gist, it
may not be faithful enough for citation or critical analysis.