Learning from LLM Disagreement in Retrieval Evaluation
Abstract
Large language models (LLMs) are being integrated into information retrieval pipelines within digital library systems for tasks such as re-ranking and filtering. However, a challenge arises from the observed disagreement between different LLMs in borderline classification cases, raising concerns about how this variability impacts downstream retrieval and the integrity of digital library collections. This study examines disagreement between two open-weight LLMs, LLaMA and Qwen, when tasked with evaluating a corpus of scholarly abstracts based on their contribution to Sustainable Development Goals (SDGs). We isolate subsets of documents where model disagreement occurs and examine their lexical properties, rank-order behavior, and classification predictability. Our results demonstrate that this model disagreement is not random: it concentrates in ambiguous cases, produces divergent top-k outputs under shared scoring functions, and is separable with AUCs above 0.74 using logistic regression. These findings suggest that LLM-based filtering introduces structured variability in document retrieval, even under controlled prompting and shared ranking logic. We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in subjective or thematic search tasks.
Citation
2025. “Learning from LLM Disagreement in Retrieval Evaluation.” In Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’25), Virtual Event, pp. 129–138. 10.1109/JCDL67857.2025.00024BibTeX
@inproceedings{ingram2025learning,
title = {Learning from LLM Disagreement in Retrieval Evaluation},
author = {Ingram, William A. and Banerjee, Bipasha and Fox, Edward A.},
year = {2025},
booktitle = {Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries},
series = {JCDL '25},
location = {Virtual Event},
pages = {129--138},
doi = {10.1109/JCDL67857.2025.00024}
}