<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://waingram.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://waingram.github.io/" rel="alternate" type="text/html" /><updated>2026-07-03T03:30:53+00:00</updated><id>https://waingram.github.io/feed.xml</id><title type="html">William A. Ingram</title><subtitle>William A. Ingram is an academic leader and researcher who studies how scientific knowledge and scholarly artifacts become machine-usable for discovery, reasoning, and reuse.</subtitle><entry><title type="html">Learning from LLM Disagreement in Retrieval Evaluation</title><link href="https://waingram.github.io/blog/learning-from-llm-disagreement-in-retrieval-evaluation/" rel="alternate" type="text/html" title="Learning from LLM Disagreement in Retrieval Evaluation" /><published>2025-12-19T17:00:00+00:00</published><updated>2025-12-19T17:00:00+00:00</updated><id>https://waingram.github.io/blog/learning-from-llm-disagreement-in-retrieval-evaluation</id><content type="html" xml:base="https://waingram.github.io/blog/learning-from-llm-disagreement-in-retrieval-evaluation/"><![CDATA[<p>Our <a href="https://2025.jcdl.org/">JCDL 2025</a> paper with <a href="https://bipasha-banerjee.github.io/">Bipasha Banerjee</a> and <a href="https://fox.cs.vt.edu/">Edward A. Fox</a> examines how model disagreement changes retrieval evaluation when large language models filter scholarly records before ranking. “Learning from LLM Disagreement in Retrieval Evaluation” shows that disagreement between relevance labelers can identify cases near the boundary of an information need. In thematic retrieval tasks, particularly ones involving <a href="https://sdgs.un.org/goals">Sustainable Development Goals (SDGs)</a>, those boundary cases determine which records remain available to a dashboard, bibliography, or downstream synthesis.<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p>

<p>Universities and research organizations use bibliographic databases and digital library systems to describe how research contributes to strategic priorities. SDG mapping illustrates the retrieval problem. A university may want to know which publications support affordable clean energy, good health and well-being, or poverty reduction. Existing workflows often begin with Boolean search queries, including the SDG query sets used in <a href="https://www.scopus.com/">Scopus</a> and related bibliometric tools.</p>

<p>Boolean queries retrieve documents that contain the selected terms, but they do not determine whether a document makes a substantive contribution to the goal. A paper may mention “energy,” “poverty,” or “health” without advancing an SDG target. LLMs therefore enter these workflows as post-retrieval filters, reading abstracts and assigning semantic relevance labels after keyword retrieval has produced a candidate set.</p>

<p>Semantic relevance judgment is not a stable measuring instrument. In our study, two locally hosted open-weight models, <a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct">LLaMA 3.1-8B</a> and <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen 2.5-7B</a>, labeled the same abstract-SDG pairs as relevant or non-relevant using the same structured prompt. The models often agreed, but their disagreements occurred in the ambiguous middle, where the relation between a publication and an SDG was plausible but not obvious.</p>

<p>We built a corpus from <a href="https://elsevier.digitalcommonsdata.com/datasets/y2zyy9vwzy/1">Elsevier’s 2023 SDG-aligned Boolean queries</a>, retrieved up to 20,000 Scopus records for each of the 17 SDGs, and cleaned the resulting metadata. The experiments focused on three goals that occupy different regions of the SDG co-occurrence structure. SDG 1 (No Poverty) clusters with social and governance goals, SDG 3 (Good Health and Well-Being) represents the health domain, and SDG 7 (Affordable and Clean Energy) anchors the technical and environmental cluster. After deduplication and cleaning, the working set for those three goals contained 46,755 labeled rows representing 46,573 unique abstracts.</p>

<p>Each model evaluated whether an abstract made a substantive contribution to the indicated SDG. We then isolated four kinds of cases: documents both models labeled relevant, documents both labeled non-relevant, documents only LLaMA labeled relevant, and documents only Qwen labeled relevant. This partition made disagreement the object of analysis rather than a residual error category.</p>

<p>The analysis measured agreement and Cohen’s kappa, compared lexical patterns in the disagreement subsets with TF-IDF and permutation tests, simulated ranked retrieval over the ambiguous cases, and trained logistic regression classifiers to test whether lexical features predicted which model assigned relevance.</p>

<p>Across the three SDGs, the models assigned the same label in 83.6% of cases, but Cohen’s kappa was only 0.467. Raw agreement overstated reliability because both models labeled many abstracts as relevant. Kappa showed weaker reliability once chance agreement and class imbalance were taken into account. The disagreement region, roughly 15-20% of decisions per SDG, was a structured set of borderline cases rather than a random residue.</p>

<figure class="paper-figure">
  <img src="/img/blog/llm-agreement-breakdown.png" alt="Stacked bar chart showing both non-relevant labels, both relevant labels, and model disagreement for SDG 1, SDG 3, and SDG 7, with kappa values of 0.51, 0.40, and 0.43." />
  <figcaption>Agreement between LLaMA and Qwen was concentrated in shared relevant labels, which explains why raw agreement and Cohen's kappa diverged across SDGs.</figcaption>
</figure>

<p>We ran a negative control to test whether the shared relevant labels reflected a general tendency to include documents. Applying the SDG 7 energy prompt to abstracts retrieved by the SDG 1 poverty query produced 91% agreement, with Cohen’s kappa of 0.59. The models jointly assigned non-relevance to 84% of cases, jointly assigned relevance to 7%, and disagreed on 8%. The main experiment therefore cannot be reduced to affirmative-label bias; the models converged on rejection when the prompt and candidate set described conceptually distant SDGs.</p>

<p>Lexical analysis identified model-specific relevance criteria. For SDG 1, LLaMA assigned relevance more often to documents using healthcare access terms such as health, care, insurance, and coverage, while Qwen assigned relevance more often to documents using terms associated with structural inequality, wealth, income, and taxation. For SDG 3, LLaMA assigned relevance more often to clinical and procedural terms, while Qwen assigned relevance more often to molecular and cellular terms. For SDG 7, LLaMA assigned relevance more often to systems and infrastructure terms, while Qwen assigned relevance more often to electrochemistry and battery terms. The FDR-adjusted p-values for the reported terms were below 0.001.</p>

<figure class="paper-table" aria-labelledby="sdg-terms-table-caption">
  <figcaption id="sdg-terms-table-caption">Top differentiating terms between LLaMA-relevant and Qwen-relevant documents in the disagreement subsets. Positive values indicate terms with higher mean TF-IDF in LLaMA-relevant documents; negative values indicate terms with higher mean TF-IDF in Qwen-relevant documents.</figcaption>
  <div class="paper-table-scroll" role="region" aria-label="Top differentiating terms by SDG" tabindex="0">
    <table aria-describedby="sdg-terms-table-caption">
      <thead>
        <tr>
          <th scope="col">SDG</th>
          <th scope="col">LLaMA-relevant terms</th>
          <th scope="col">Qwen-relevant terms</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th scope="row">SDG 1<br /><span>No Poverty</span></th>
          <td>health (+0.019), care (+0.014), insurance (+0.013), covid (+0.010), coverage (+0.010)</td>
          <td>inequality (-0.020), wealth (-0.012), income (-0.009), tax (-0.008), political (-0.005)</td>
        </tr>
        <tr>
          <th scope="row">SDG 3<br /><span>Health</span></th>
          <td>patients (+0.023), risk (+0.007), tavr (+0.007), stroke (+0.006), coronary (+0.006)</td>
          <td>cells (-0.019), cancer (-0.018), cell (-0.017), tumor (-0.015), human (-0.007)</td>
        </tr>
        <tr>
          <th scope="row">SDG 7<br /><span>Energy</span></th>
          <td>fuel (+0.006), computing (+0.006), neural (+0.004), plasma (+0.004), network (+0.005)</td>
          <td>lithium (-0.018), capacity (-0.018), ion (-0.016), batteries (-0.016), anode (-0.015)</td>
        </tr>
      </tbody>
    </table>
  </div>
</figure>

<p>The retrieval experiments show a direct consequence for ranked output. Under a fixed scoring function applied to the same disagreement pool, the top-ranked documents changed according to the model that filtered the candidate set. In SDG 7, the LLaMA-relevant subset contained 19 of the top 20 centroid-ranked disagreement documents, while the Qwen-relevant subset contained one. The ranking logic was held constant, so the difference came from the earlier relevance filter.</p>

<p>A separate classification experiment showed that disagreement was learnable from lexical features. Logistic regression classifiers trained on TF-IDF features predicted which model labeled a disagreement document as relevant with AUC scores above chance for all three goals. The AUC was 0.739 for SDG 1, 0.753 for SDG 3, and 0.703 for SDG 7. These results do not identify either model as correct. They show that the models used different, learnable lexical criteria when assigning relevance.</p>

<p>The paper does not use LLM labels as ground truth. For subjective retrieval tasks, including policy-relevant tasks such as SDG assessment, a single definitive label may not exist. A publication can contribute to a goal directly, indirectly, methodologically, or under a particular interpretation of the target. In that setting, the better evaluation question is how each model changes the corpus available for ranking and synthesis.</p>

<p>A single LLM filter cannot be described as neutral preprocessing in SDG retrieval. It determines document eligibility before ranking begins and can remove alternative interpretations of relevance from the result set. A dashboard, literature review, or retrieval-augmented generation workflow built on a filtered corpus inherits those exclusions.</p>

<p>Digital library systems that use LLM filtering should report and inspect disagreement sets rather than relying on aggregate agreement. Audits should identify which topics each model admits or excludes, which disciplines gain or lose representation, which lexical cues mark the edge of relevance, and which documents disappear before ranking begins.</p>

<p>Retrieval workflows should expose model disagreement before downstream synthesis. Multi-model filtering, human review of disagreement cases, and audits of model justifications can locate where model-specific criteria enter the retrieval process. Extending the analysis to additional models, domains, and RAG-based policy briefs would measure how filtering variability changes the substantive content of generated outputs.</p>

<p>When LLMs disagree in thematic retrieval, the disagreement identifies the documents most sensitive to the definition of relevance. Those documents require analysis before the filtered corpus is used for institutional reporting or evidence synthesis.</p>

<p>The paper is available through IEEE with DOI <a href="https://doi.org/10.1109/JCDL67857.2025.00024">10.1109/JCDL67857.2025.00024</a>, and the project code is available on <a href="https://github.com/waingram/llm-sdg-disagreement">GitHub</a>.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>William A. Ingram, Bipasha Banerjee, and Edward A. Fox. 2025. Learning from LLM Disagreement in Retrieval Evaluation. In <em>2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</em>. IEEE. <a href="https://doi.org/10.1109/JCDL67857.2025.00024">https://doi.org/10.1109/JCDL67857.2025.00024</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="AI," /><category term="Digital" /><category term="Libraries," /><category term="Information" /><category term="Retrieval," /><category term="SDGs," /><category term="JCDL" /><category term="AI" /><category term="Digital Libraries" /><category term="Information Retrieval" /><category term="SDGs" /><category term="JCDL" /><summary type="html"><![CDATA[Our JCDL 2025 paper with Bipasha Banerjee and Edward A. Fox examines how model disagreement changes retrieval evaluation when large language models filter scholarly records before ranking. “Learning from LLM Disagreement in Retrieval Evaluation” shows that disagreement between relevance labelers can identify cases near the boundary of an information need. In thematic retrieval tasks, particularly ones involving Sustainable Development Goals (SDGs), those boundary cases determine which records remain available to a dashboard, bibliography, or downstream synthesis.1 William A. Ingram, Bipasha Banerjee, and Edward A. Fox. 2025. Learning from LLM Disagreement in Retrieval Evaluation. In 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE. https://doi.org/10.1109/JCDL67857.2025.00024 &#8617;]]></summary></entry><entry><title type="html">The VTechAGP Dataset: A Benchmark for Academic-to-General-Audience Paraphrasing</title><link href="https://waingram.github.io/blog/the-vtechagp-dataset-a-benchmark-for-academic-to-general-audience-paraphrasing/" rel="alternate" type="text/html" title="The VTechAGP Dataset: A Benchmark for Academic-to-General-Audience Paraphrasing" /><published>2025-02-10T20:58:00+00:00</published><updated>2025-02-10T20:58:00+00:00</updated><id>https://waingram.github.io/blog/the-vtechagp-dataset-a-benchmark-for-academic-to-general-audience-paraphrasing</id><content type="html" xml:base="https://waingram.github.io/blog/the-vtechagp-dataset-a-benchmark-for-academic-to-general-audience-paraphrasing/"><![CDATA[<p>I recently collaborated with <a href="https://github.com/SIGSEGV-0x7">Ming Cheng</a> and 
<a href="https://sites.google.com/vt.edu/jiaying-gong/home">Jiaying Gong</a>, two members of
the <a href="https://people.cs.vt.edu/hdardiry/lab/">Machine Learning Laboratory</a> research team
led by <a href="https://people.cs.vt.edu/~hdardiry/">Dr. Hoda Eldardiry</a>. We created
the VTechAGP dataset to support research on text simplification and paraphrase
generation.</p>

<h3 id="motivation">Motivation</h3>

<p>Non-specialists often struggle with academic writing due to its
discipline-specific jargon and rigid conventions. Text simplification
research seeks to mitigate this challenge by reducing linguistic
complexity, primarily through lexical and syntactic modifications.
Existing datasets are restricted to sentence-level transformations and lack
coverage across multiple domains, limiting their utility for broader
applications. While domain-specific datasets, such as those in medicine
and law, provide targeted resources, they do not support general-purpose
simplification across disciplines. For a comparison of existing text simplification and paraphrase datasets, see Table 5 in the appendix of Cheng et al. (2024).<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p>

<h3 id="institutional-context-and-rationale">Institutional Context and Rationale</h3>

<p>To address the lack of broadly applicable, document-level datasets, we propose VTechAGP,<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> which provides a parallel corpus that pairs full academic abstracts with their general-audience counterparts. These pairs were collected
from <a href="https://vtechworks.lib.vt.edu/">VTechWorks</a>, Virginia Tech’s
institutional repository, which includes Electronic Theses and
Dissertations (ETDs) along with other scholarly materials. The <a href="https://guides.lib.vt.edu/ETDguide">Graduate
School’s ETD policies</a> require
students to submit both a traditional academic abstract and a
general-audience abstract as part of their thesis or dissertation.
Because these abstracts are written for distinct audiences but
correspond to the same work, they provide a basis for analyzing
differences in lexical complexity, syntactic variation, and semantic
focus between academic and general-audience writing. By capturing these
distinctions at the document level, the dataset serves as a resource for
research in text simplification and domain adaptation in NLP.</p>

<h3 id="dataset-construction">Dataset Construction</h3>

<p>To build the dataset, I collected academic and general audience abstracts from <a href="https://vtechworks.lib.vt.edu/">VTechWorks</a>, using the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)</a>. Each record includes:</p>

<ul>
  <li>A traditional academic abstract.</li>
  <li>A corresponding general audience abstract.</li>
  <li>Metadata such as title, discipline, department, and degree information.</li>
</ul>

<p>The dataset consists of paired academic and general-audience abstracts, allowing for document-level analysis of structural and linguistic differences. This enables potential applications in NLP for document-level paraphrasing and retrieval-based reformulation tasks.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>Ming Cheng, Jiaying Gong, Chenhan Yuan, William A. Ingram, Edward A. Fox, and Hoda Eldardiry. 2024. VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models. arXiv:2411.04825. <a href="https://doi.org/10.48550/arXiv.2411.04825">https://doi.org/10.48550/arXiv.2411.04825</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2">
      <p>The VTechAGP dataset is publicly available via <a href="https://doi.org/10.5281/zenodo.14833933">Zenodo (DOI: 10.5281/zenodo.14833933)</a> and <a href="https://github.com/waingram/VTechAGP-Dataset">GitHub</a>, distributed under the Open Data Commons Attribution License (ODC-By). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="NLP" /><category term="Datasets" /><category term="Digital Libraries" /><category term="Text Simplification" /><summary type="html"><![CDATA[I recently collaborated with Ming Cheng and Jiaying Gong, two members of the Machine Learning Laboratory research team led by Dr. Hoda Eldardiry. We created the VTechAGP dataset to support research on text simplification and paraphrase generation.]]></summary></entry><entry><title type="html">Small, Locally-Hosted LLMs for Sustainable Development Goal Classification</title><link href="https://waingram.github.io/blog/ieee-poster-sdg/" rel="alternate" type="text/html" title="Small, Locally-Hosted LLMs for Sustainable Development Goal Classification" /><published>2024-12-05T17:36:48+00:00</published><updated>2024-12-05T17:36:48+00:00</updated><id>https://waingram.github.io/blog/ieee-poster-sdg</id><content type="html" xml:base="https://waingram.github.io/blog/ieee-poster-sdg/"><![CDATA[<p>We are excited to announce that “Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals” has been accepted as a poster at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024), which will take place from December 15–18, 2024, in Washington, DC. <a href="http://bigdataieee.org/BigData2024/">Learn more about the conference</a>.</p>

<hr />

<h2 id="about-the-study">About the Study</h2>

<p>Accurately assessing research contributions to the United Nations’ Sustainable Development Goals (SDGs) is a growing priority for academic institutions. Traditional methods, which rely heavily on keyword-based Boolean search queries, often conflate incidental keyword matches with genuine contributions to SDG targets, leading to reduced precision in bibliometric analyses.</p>

<p>Our study proposes a novel approach: leveraging small, locally-hosted Large Language Models (LLMs) as evaluation agents to address the limitations of keyword-based retrieval. Using a dataset of 340,000 abstracts retrieved via SDG-specific keyword queries, we demonstrated how these models can distinguish between semantically relevant contributions to SDG targets and incidental mentions.</p>

<hr />

<h2 id="key-highlights">Key Highlights</h2>

<ul>
  <li>Novel Application: We evaluated three small, locally-hosted LLMs—Mistral-7B, Phi-3.5-mini, and Llama-3.2—for their ability to classify SDG-related research contributions with greater precision than traditional methods.</li>
  <li>Improved Precision: These models leverage their semantic understanding to move beyond surface-level keyword matching, addressing key limitations in traditional SDG classification workflows.</li>
  <li>Scalability: By running these LLMs locally, the approach offers a cost-efficient and scalable framework for institutions to align research with SDG goals.</li>
</ul>

<hr />

<h2 id="why-it-matters">Why It Matters</h2>

<p>This work represents a step forward in SDG-related research evaluation, providing a more nuanced and precise approach to classifying scholarly contributions. The findings have broader implications for institutional benchmarking, funding strategies, and semantic search applications.</p>

<hr />

<h2 id="future-directions">Future Directions</h2>

<p>Our research paves the way for:</p>
<ol>
  <li>Developing multi-agent frameworks that combine multiple models to refine classification further.</li>
  <li>Applying these techniques in semantic search systems to enable more effective discovery of SDG-relevant research.</li>
</ol>

<hr />

<h2 id="read-the-full-preprint">Read the Full Preprint</h2>

<p>The full preprint of our work is available on arXiv: <a href="https://arxiv.org/abs/2411.17598">https://arxiv.org/abs/2411.17598</a>.</p>

<p>We look forward to presenting this work at IEEE BigData 2024 and engaging with the community on the potential of LLMs in advancing SDG-related research.</p>

<hr />]]></content><author><name></name></author><category term="AI," /><category term="SDGs," /><category term="IEEE," /><category term="BigData," /><category term="Poster" /><category term="AI" /><category term="SDGs" /><category term="IEEE BigData" /><category term="Poster" /><summary type="html"><![CDATA[We are excited to announce that “Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals” has been accepted as a poster at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024), which will take place from December 15–18, 2024, in Washington, DC. Learn more about the conference.]]></summary></entry></feed>