William A. Ingram

Learning from LLM Disagreement in Retrieval Evaluation

2025-12-19T17:00:00+00:00

Our JCDL 2025 paper with Bipasha Banerjee and Edward A. Fox examines how model disagreement changes retrieval evaluation when large language models filter scholarly records before ranking. “Learning from LLM Disagreement in Retrieval Evaluation” shows that disagreement between relevance labelers can identify cases near the boundary of an information need. In thematic retrieval tasks, particularly ones involving Sustainable Development Goals (SDGs), those boundary cases determine which records remain available to a dashboard, bibliography, or downstream synthesis.¹

Universities and research organizations use bibliographic databases and digital library systems to describe how research contributes to strategic priorities. SDG mapping illustrates the retrieval problem. A university may want to know which publications support affordable clean energy, good health and well-being, or poverty reduction. Existing workflows often begin with Boolean search queries, including the SDG query sets used in Scopus and related bibliometric tools.

Boolean queries retrieve documents that contain the selected terms, but they do not determine whether a document makes a substantive contribution to the goal. A paper may mention “energy,” “poverty,” or “health” without advancing an SDG target. LLMs therefore enter these workflows as post-retrieval filters, reading abstracts and assigning semantic relevance labels after keyword retrieval has produced a candidate set.

Semantic relevance judgment is not a stable measuring instrument. In our study, two locally hosted open-weight models, LLaMA 3.1-8B and Qwen 2.5-7B, labeled the same abstract-SDG pairs as relevant or non-relevant using the same structured prompt. The models often agreed, but their disagreements occurred in the ambiguous middle, where the relation between a publication and an SDG was plausible but not obvious.

We built a corpus from Elsevier’s 2023 SDG-aligned Boolean queries, retrieved up to 20,000 Scopus records for each of the 17 SDGs, and cleaned the resulting metadata. The experiments focused on three goals that occupy different regions of the SDG co-occurrence structure. SDG 1 (No Poverty) clusters with social and governance goals, SDG 3 (Good Health and Well-Being) represents the health domain, and SDG 7 (Affordable and Clean Energy) anchors the technical and environmental cluster. After deduplication and cleaning, the working set for those three goals contained 46,755 labeled rows representing 46,573 unique abstracts.

Each model evaluated whether an abstract made a substantive contribution to the indicated SDG. We then isolated four kinds of cases: documents both models labeled relevant, documents both labeled non-relevant, documents only LLaMA labeled relevant, and documents only Qwen labeled relevant. This partition made disagreement the object of analysis rather than a residual error category.

The analysis measured agreement and Cohen’s kappa, compared lexical patterns in the disagreement subsets with TF-IDF and permutation tests, simulated ranked retrieval over the ambiguous cases, and trained logistic regression classifiers to test whether lexical features predicted which model assigned relevance.

Across the three SDGs, the models assigned the same label in 83.6% of cases, but Cohen’s kappa was only 0.467. Raw agreement overstated reliability because both models labeled many abstracts as relevant. Kappa showed weaker reliability once chance agreement and class imbalance were taken into account. The disagreement region, roughly 15-20% of decisions per SDG, was a structured set of borderline cases rather than a random residue.

Agreement between LLaMA and Qwen was concentrated in shared relevant labels, which explains why raw agreement and Cohen's kappa diverged across SDGs.

We ran a negative control to test whether the shared relevant labels reflected a general tendency to include documents. Applying the SDG 7 energy prompt to abstracts retrieved by the SDG 1 poverty query produced 91% agreement, with Cohen’s kappa of 0.59. The models jointly assigned non-relevance to 84% of cases, jointly assigned relevance to 7%, and disagreed on 8%. The main experiment therefore cannot be reduced to affirmative-label bias; the models converged on rejection when the prompt and candidate set described conceptually distant SDGs.

Lexical analysis identified model-specific relevance criteria. For SDG 1, LLaMA assigned relevance more often to documents using healthcare access terms such as health, care, insurance, and coverage, while Qwen assigned relevance more often to documents using terms associated with structural inequality, wealth, income, and taxation. For SDG 3, LLaMA assigned relevance more often to clinical and procedural terms, while Qwen assigned relevance more often to molecular and cellular terms. For SDG 7, LLaMA assigned relevance more often to systems and infrastructure terms, while Qwen assigned relevance more often to electrochemistry and battery terms. The FDR-adjusted p-values for the reported terms were below 0.001.

Top differentiating terms between LLaMA-relevant and Qwen-relevant documents in the disagreement subsets. Positive values indicate terms with higher mean TF-IDF in LLaMA-relevant documents; negative values indicate terms with higher mean TF-IDF in Qwen-relevant documents.

SDG	LLaMA-relevant terms	Qwen-relevant terms
SDG 1 No Poverty	health (+0.019), care (+0.014), insurance (+0.013), covid (+0.010), coverage (+0.010)	inequality (-0.020), wealth (-0.012), income (-0.009), tax (-0.008), political (-0.005)
SDG 3 Health	patients (+0.023), risk (+0.007), tavr (+0.007), stroke (+0.006), coronary (+0.006)	cells (-0.019), cancer (-0.018), cell (-0.017), tumor (-0.015), human (-0.007)
SDG 7 Energy	fuel (+0.006), computing (+0.006), neural (+0.004), plasma (+0.004), network (+0.005)	lithium (-0.018), capacity (-0.018), ion (-0.016), batteries (-0.016), anode (-0.015)

The retrieval experiments show a direct consequence for ranked output. Under a fixed scoring function applied to the same disagreement pool, the top-ranked documents changed according to the model that filtered the candidate set. In SDG 7, the LLaMA-relevant subset contained 19 of the top 20 centroid-ranked disagreement documents, while the Qwen-relevant subset contained one. The ranking logic was held constant, so the difference came from the earlier relevance filter.

A separate classification experiment showed that disagreement was learnable from lexical features. Logistic regression classifiers trained on TF-IDF features predicted which model labeled a disagreement document as relevant with AUC scores above chance for all three goals. The AUC was 0.739 for SDG 1, 0.753 for SDG 3, and 0.703 for SDG 7. These results do not identify either model as correct. They show that the models used different, learnable lexical criteria when assigning relevance.

The paper does not use LLM labels as ground truth. For subjective retrieval tasks, including policy-relevant tasks such as SDG assessment, a single definitive label may not exist. A publication can contribute to a goal directly, indirectly, methodologically, or under a particular interpretation of the target. In that setting, the better evaluation question is how each model changes the corpus available for ranking and synthesis.

A single LLM filter cannot be described as neutral preprocessing in SDG retrieval. It determines document eligibility before ranking begins and can remove alternative interpretations of relevance from the result set. A dashboard, literature review, or retrieval-augmented generation workflow built on a filtered corpus inherits those exclusions.

Digital library systems that use LLM filtering should report and inspect disagreement sets rather than relying on aggregate agreement. Audits should identify which topics each model admits or excludes, which disciplines gain or lose representation, which lexical cues mark the edge of relevance, and which documents disappear before ranking begins.

Retrieval workflows should expose model disagreement before downstream synthesis. Multi-model filtering, human review of disagreement cases, and audits of model justifications can locate where model-specific criteria enter the retrieval process. Extending the analysis to additional models, domains, and RAG-based policy briefs would measure how filtering variability changes the substantive content of generated outputs.

When LLMs disagree in thematic retrieval, the disagreement identifies the documents most sensitive to the definition of relevance. Those documents require analysis before the filtered corpus is used for institutional reporting or evidence synthesis.

The paper is available through IEEE with DOI 10.1109/JCDL67857.2025.00024, and the project code is available on GitHub.

William A. Ingram, Bipasha Banerjee, and Edward A. Fox. 2025. Learning from LLM Disagreement in Retrieval Evaluation. In 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE. https://doi.org/10.1109/JCDL67857.2025.00024 ↩

The VTechAGP Dataset: A Benchmark for Academic-to-General-Audience Paraphrasing

2025-02-10T20:58:00+00:00

I recently collaborated with Ming Cheng and Jiaying Gong, two members of the Machine Learning Laboratory research team led by Dr. Hoda Eldardiry. We created the VTechAGP dataset to support research on text simplification and paraphrase generation.

Motivation

Non-specialists often struggle with academic writing due to its discipline-specific jargon and rigid conventions. Text simplification research seeks to mitigate this challenge by reducing linguistic complexity, primarily through lexical and syntactic modifications. Existing datasets are restricted to sentence-level transformations and lack coverage across multiple domains, limiting their utility for broader applications. While domain-specific datasets, such as those in medicine and law, provide targeted resources, they do not support general-purpose simplification across disciplines. For a comparison of existing text simplification and paraphrase datasets, see Table 5 in the appendix of Cheng et al. (2024).¹

Institutional Context and Rationale

To address the lack of broadly applicable, document-level datasets, we propose VTechAGP,² which provides a parallel corpus that pairs full academic abstracts with their general-audience counterparts. These pairs were collected from VTechWorks, Virginia Tech’s institutional repository, which includes Electronic Theses and Dissertations (ETDs) along with other scholarly materials. The Graduate School’s ETD policies require students to submit both a traditional academic abstract and a general-audience abstract as part of their thesis or dissertation. Because these abstracts are written for distinct audiences but correspond to the same work, they provide a basis for analyzing differences in lexical complexity, syntactic variation, and semantic focus between academic and general-audience writing. By capturing these distinctions at the document level, the dataset serves as a resource for research in text simplification and domain adaptation in NLP.

Dataset Construction

To build the dataset, I collected academic and general audience abstracts from VTechWorks, using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Each record includes:

A traditional academic abstract.
A corresponding general audience abstract.
Metadata such as title, discipline, department, and degree information.

The dataset consists of paired academic and general-audience abstracts, allowing for document-level analysis of structural and linguistic differences. This enables potential applications in NLP for document-level paraphrasing and retrieval-based reformulation tasks.

Ming Cheng, Jiaying Gong, Chenhan Yuan, William A. Ingram, Edward A. Fox, and Hoda Eldardiry. 2024. VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models. arXiv:2411.04825. https://doi.org/10.48550/arXiv.2411.04825 ↩
The VTechAGP dataset is publicly available via Zenodo (DOI: 10.5281/zenodo.14833933) and GitHub, distributed under the Open Data Commons Attribution License (ODC-By). ↩

Small, Locally-Hosted LLMs for Sustainable Development Goal Classification

2024-12-05T17:36:48+00:00

We are excited to announce that “Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals” has been accepted as a poster at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024), which will take place from December 15–18, 2024, in Washington, DC. Learn more about the conference.

About the Study

Accurately assessing research contributions to the United Nations’ Sustainable Development Goals (SDGs) is a growing priority for academic institutions. Traditional methods, which rely heavily on keyword-based Boolean search queries, often conflate incidental keyword matches with genuine contributions to SDG targets, leading to reduced precision in bibliometric analyses.

Our study proposes a novel approach: leveraging small, locally-hosted Large Language Models (LLMs) as evaluation agents to address the limitations of keyword-based retrieval. Using a dataset of 340,000 abstracts retrieved via SDG-specific keyword queries, we demonstrated how these models can distinguish between semantically relevant contributions to SDG targets and incidental mentions.

Key Highlights

Novel Application: We evaluated three small, locally-hosted LLMs—Mistral-7B, Phi-3.5-mini, and Llama-3.2—for their ability to classify SDG-related research contributions with greater precision than traditional methods.
Improved Precision: These models leverage their semantic understanding to move beyond surface-level keyword matching, addressing key limitations in traditional SDG classification workflows.
Scalability: By running these LLMs locally, the approach offers a cost-efficient and scalable framework for institutions to align research with SDG goals.

Why It Matters

This work represents a step forward in SDG-related research evaluation, providing a more nuanced and precise approach to classifying scholarly contributions. The findings have broader implications for institutional benchmarking, funding strategies, and semantic search applications.

Future Directions

Our research paves the way for:

Developing multi-agent frameworks that combine multiple models to refine classification further.
Applying these techniques in semantic search systems to enable more effective discovery of SDG-relevant research.

Read the Full Preprint

The full preprint of our work is available on arXiv: https://arxiv.org/abs/2411.17598.

We look forward to presenting this work at IEEE BigData 2024 and engaging with the community on the potential of LLMs in advancing SDG-related research.