Learning from LLM Disagreement in Retrieval Evaluation
Our JCDL 2025 paper with Bipasha Banerjee and Edward A. Fox examines how model disagreement changes retrieval evaluation when large language models filter scholarly records before ranking. “Learning from LLM Disagreement in Retrieval Evaluation” shows that disagreement between relevance labelers can identify cases near the boundary of an information need. In thematic retrieval tasks, particularly ones involving Sustainable Development Goals (SDGs), those boundary cases determine which records remain available to a dashboard,...