{
  "id": "https://waingram.github.io/publications.json",
  "type": "publication_collection",
  "last_updated": "2026-06-18",
  "owner": "https://waingram.github.io/#waingram",
  "is_complete_publication_list": true,
  "source": {
    "human_readable_page": "https://waingram.github.io/publications/",
    "bibtex_directory": "_data/publications",
    "categories": [
      "journal_articles",
      "book_chapters",
      "conference_papers",
      "workshop_papers",
      "extended_abstracts",
      "tutorials",
      "hosted_workshops",
      "reports"
    ]
  },
  "publications": [
    {
  "id": "https://waingram.github.io/publications/ingram2024building.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2024building",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Building datasets to support information extraction and structure parsing from electronic theses and dissertations",
  "authors": ["Ingram, William A.","Wu, Jian","Kahu, Sampanna Yashwant","Manzoor, Javaid Akbar","Banerjee, Bipasha","Ahuja, Aman","Choudhury, Muntabir Hasan","Salsabil, Lamia","Shields, Winston","Fox, Edward A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "International Journal on Digital Libraries",
  "journal": "International Journal on Digital Libraries",
  "volume": "25",
  "number": "2",
  "pages": "175–196",
  "doi": "10.1007/s00799-024-00395-4",
  "doi_url": "https://doi.org/10.1007/s00799-024-00395-4",
  "url": "https://doi.org/10.1007/s00799-024-00395-4",
  "abstract": "Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs."
},{
  "id": "https://waingram.github.io/publications/li2020teaching.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "li2020teaching",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based Learning",
  "authors": ["Li, Liuqing","Geissinger, Jack","Ingram, William A.","Fox, Edward A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Data and Information Management",
  "journal": "Data and Information Management",
  "volume": "4",
  "number": "1",
  "pages": "18–43",
  "doi": "10.2478/dim-2020-0003",
  "doi_url": "https://doi.org/10.2478/dim-2020-0003",
  "url": "https://doi.org/10.2478/dim-2020-0003",
  "abstract": "Natural language processing (NLP) covers a large number of topics and tasks related to data and information management, leading to a complex and challenging teaching process. Meanwhile, problem-based learning is a teaching technique specifically designed to motivate students to learn efficiently, work collaboratively, and communicate effectively. With this aim, we developed a problem-based learning course for both undergraduate and graduate students to teach NLP. We provided student teams with big data sets, basic guidelines, cloud computing resources, and other aids to help different teams in summarizing two types of big collections: Web pages related to events, and electronic theses and dissertations (ETDs). Student teams then deployed different libraries, tools, methods, and algorithms to solve the task of big data text summarization. Summarization is an ideal problem to address learning NLP since it involves all levels of linguistics, as well as many of the tools and techniques used by NLP practitioners. The evaluation results showed that all teams generated coherent and readable summaries. Many summaries were of high quality and accurately described their corresponding events or ETD chapters, and the teams produced them along with NLP pipelines in a single semester. Further, both undergraduate and graduate students gave statistically significant positive feedback, relative to other courses in the Department of Computer Science. Accordingly, we encourage educators in the data and information management field to use our approach or similar methods in their teaching and hope that other researchers will also use our data sets and synergistic solutions to approach the new and challenging tasks we addressed."
},{
  "id": "https://waingram.github.io/publications/ingram2019summarizing.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2019summarizing",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Summarizing ETDs with deep learning",
  "authors": ["Ingram, William A.","Banerjee, Bipasha","Fox, Edward A."],
  "year": "2019",
  "date_published": "2019",
  "container_title": "Cadernos BAD",
  "journal": "Cadernos BAD",
  "volume": "1",
  "pages": "46–52",
  "doi": "10.48798/cadernosbad.2014",
  "doi_url": "https://doi.org/10.48798/cadernosbad.2014",
  "url": "https://doi.org/10.48798/cadernosbad.2014",
  "abstract": "Inspired by the millions of Electronic Theses and Dissertations (ETDs) openly available online, we describe a novel use of ETDs as data for text summarization. We use a large corpus of ETDs to evaluate techniques for generating abstractive summaries with deep learning. Using an extensive ETD collection of over 30,000 doctoral dissertations and master’s theses, we examine the quality of state-of-the-art deep learning summarization technologies when applied to an ETD corpus. Deep learning requires a large set of training data to produce satisfactory results. Finding suitable training data is especially difficult due to the widespread use of domain-specific jargon in ETDs, coupled with the wide-ranging breadth of subject matter contained in an ETD corpus. To overcome this significant limitation, we demonstrate the potential of transfer learning on automatic summarization of ETD chapters. We apply several combinations of deep learning models and training data to the ETD chapter summarization task and compare the outputs of the top performers."
},{
  "id": "https://waingram.github.io/publications/fallaw2016overly.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "fallaw2016overly",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Overly Honest Data Repository Development",
  "authors": ["Fallaw, Colleen","Dunham, Elise","Wickes, Elizabeth","Strong, Dena","Stein, Ayla","Zhang, Qian","Rimkus, Kyle","Ingram, William A.","Imker, Heidi J."],
  "year": "2016",
  "date_published": "2016",
  "container_title": "The Code4Lib Journal",
  "journal": "The Code4Lib Journal",
  "number": "34",
  "url": "https://journal.code4lib.org/articles/11980",
  "abstract": "After a year of development, the library at the University of Illinois at Urbana-Champaign has launched a repository, called the Illinois Data Bank (https://databank.illinois.edu/), to provide Illinois researchers with a free, self-serve publishing platform that centralizes, preserves, and provides persistent and reliable access to Illinois research data. This article presents a holistic view of development by discussing our overarching technical, policy, and interface strategies. By openly presenting our design decisions, the rationales behind those decisions, and associated challenges this paper aims to contribute to the library community’s work to develop repository services that meet growing data preservation and sharing needs."
},{
  "id": "https://waingram.github.io/publications/habing2009developments.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "habing2009developments",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Developments in Digital Preservation at the University of Illinois: The Hub and Spoke Architecture for Supporting Repository Interoperability and Emerging Preservation Standards",
  "authors": ["Habing, Thomas","Eke, Janet","Cordial, Matthew A.","Ingram, William","Manaster, Robert"],
  "year": "2009",
  "date_published": "2009",
  "container_title": "Library Trends",
  "journal": "Library Trends",
  "volume": "57",
  "number": "3",
  "pages": "556–579",
  "doi": "10.1353/lib.0.0052",
  "doi_url": "https://doi.org/10.1353/lib.0.0052",
  "url": "https://doi.org/10.1353/lib.0.0052",
  "abstract": "Funded by the National Digital Information Infrastructure and Preservation Program (NDIIPP), the ECHO DEPository Project supports the digital preservation efforts of the Library of Congress by contributing research and software to help society GET, SAVE, and KEEP its digital cultural heritage. Project activities include building Web archiving tools, evaluating existing repository software, developing architectures to enhance existing repositories’ interoperability and preservation features, and modeling next-generation repositories for supporting long-term preservation. This article describes the development of the Hub and Spoke (HandS) Tool Suite, built to help curators of digital objects manage content in multiple repository systems while preserving valuable preservation metadata. Implementing METS and PREMIS, HandS provides a standards-based method for packaging content that allows digital objects to be moved between repositories more easily while supporting the collection of technical and provenance information crucial for long-term preservation. Related project work investigating the more fundamental semantic issues underlying the preservation of the meaningof digital objects over time is profiled separately in this issue (Dubin et al., 2009)."
},
    {
  "id": "https://waingram.github.io/publications/Ingram2024.html",
  "type": "Chapter",
  "bibtex_key": "Ingram2024",
  "bibtex_type": "incollection",
  "publication_category": "uncategorized",
  "title": "Archives, Digital Search, and AI Ethics",
  "authors": ["Ingram, William A.","Johnson, Sylvester A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "The Routledge Companion to Libraries, Archives, and the Digital Humanities",
  "booktitle": "The Routledge Companion to Libraries, Archives, and the Digital Humanities",
  "pages": "479–492",
  "publisher": "Routledge",
  "doi": "10.4324/9781003327738-38",
  "doi_url": "https://doi.org/10.4324/9781003327738-38",
  "url": "https://www.taylorfrancis.com/chapters/10.4324/9781003327738-38/archives-digital-search-ai-ethics-william-ingram-sylvester-johnson",
  "abstract": "The rapid growth of digital records managed by national archives has generated new opportunities for professionals and the broader public to study the actions of democratic governments. To help them keep pace with the overwhelming growth of digital records, archives are turning to artificial intelligence (AI), raising concerns for ethical accountability of algorithmic technology. This chapter examines challenges and insights drawn from one case study—the National Archives and Records Administration, the official record keeper of the U.S. government. The chapter interprets findings from this case study within the global context of government archives and digital record search more broadly, addressing the urgent need for algorithmic tools to enable successful digital records search and the ethical challenges they introduce. The chapter concludes with some recommendations for developing ethical AI services for public archives and a strategy for algorithmic auditing."
},
    {
  "id": "https://waingram.github.io/publications/klair2026Evaluating.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "klair2026Evaluating",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Evaluating Human-LLM Alignment in ETD Subject Classification",
  "authors": ["Klair, Hajra","German, Fausto","Banerjee, Bipasha","Ingram, William A."],
  "year": "2026",
  "date_published": "2026",
  "container_title": "New Trends in Theory and Practice of Digital Libraries",
  "booktitle": "New Trends in Theory and Practice of Digital Libraries",
  "pages": "57–69",
  "publisher": "Springer Nature Switzerland",
  "location": "Cham",
  "doi": "10.1007/978-3-032-06136-2_6",
  "doi_url": "https://doi.org/10.1007/978-3-032-06136-2_6",
  "url": "https://doi.org/10.1007/978-3-032-06136-2_6",
  "abstract": "Author-assigned subject labels in Electronic Theses and Dissertations (ETDs) are often inconsistent, overly broad, or misaligned with the research focus. This hampers discovery, aggregation, and analysis, especially for interdisciplinary research. LLMs offer a scalable alternative for automated classification, but their labeling rationale is opaque and introduces systematic biases. This study compares subject labels generated by LLMs with human-assigned labels for over 9,000 ETDs across 21 academic categories to assess the disagreement. We evaluate multiple prompt-based and fine-tuned LLM configurations and analyze areas of agreement and disagreement to identify patterns of misclassification. LLMs achieve competitive performance overall but frequently misclassify theoretical or interdisciplinary texts, often due to overweighting lexical cues and disregarding context. We show such errors are not random but reflect structured semantic divergences from human interpretation. These findings suggest a need for hybrid frameworks that combine LLM scalability with human contextual judgment to improve subject labeling in academic repositories."
},{
  "id": "https://waingram.github.io/publications/ingram2025learning.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2025learning",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Learning from LLM Disagreement in Retrieval Evaluation",
  "authors": ["Ingram, William A.","Banerjee, Bipasha","Fox, Edward A."],
  "year": "2025",
  "date_published": "2025",
  "container_title": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’25",
  "pages": "129–138",
  "location": "Virtual Event",
  "doi": "10.1109/JCDL67857.2025.00024",
  "doi_url": "https://doi.org/10.1109/JCDL67857.2025.00024",
  "url": "https://doi.org/10.1109/JCDL67857.2025.00024",
  "abstract": "Large language models (LLMs) are being integrated into information retrieval pipelines within digital library systems for tasks such as re-ranking and filtering. However, a challenge arises from the observed disagreement between different LLMs in borderline classification cases, raising concerns about how this variability impacts downstream retrieval and the integrity of digital library collections. This study examines disagreement between two open-weight LLMs, LLaMA and Qwen, when tasked with evaluating a corpus of scholarly abstracts based on their contribution to Sustainable Development Goals (SDGs). We isolate subsets of documents where model disagreement occurs and examine their lexical properties, rank-order behavior, and classification predictability. Our results demonstrate that this model disagreement is not random: it concentrates in ambiguous cases, produces divergent top-k outputs under shared scoring functions, and is separable with AUCs above 0.74 using logistic regression. These findings suggest that LLM-based filtering introduces structured variability in document retrieval, even under controlled prompting and shared ranking logic. We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in subjective or thematic search tasks."
},{
  "id": "https://waingram.github.io/publications/salsabil2025contextbased.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "salsabil2025contextbased",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents",
  "authors": ["Salsabil, Lamia","Obadage, Rochana R.","Banerjee, Bipasha","Abeysinghe, Yasasi","Alam, Sawood","Färber, Michael","Ingram, William","Fox, Edward","Wu, Jian"],
  "year": "2025",
  "date_published": "2025",
  "container_title": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’25",
  "pages": "197–206",
  "location": "Virtual Event",
  "doi": "10.1109/JCDL67857.2025.00031",
  "doi_url": "https://doi.org/10.1109/JCDL67857.2025.00031",
  "url": "https://doi.org/10.1109/JCDL67857.2025.00031",
  "abstract": "This study presents a novel framework for automatically classifying open-access datasets and software (OADS) URLs in scholarly documents. Accurate classification of OADS-URLs is the first step in investigating the availability and preservability of OADS, a crucial step toward open science and computational reproducibility. Our framework, EnSU, leverages an ensemblebased approach to classify OADS-URLs by their citation contexts. The ensemble integrates three models: a Supervised Contrastive Learning model, a SciBERT-based model, and a BertGCN model. Our framework distinguishes the resource types (dataset vs. software) and providers (author vs. third-party). To train and evaluate EnSU, we compiled a dataset, OADS-1K, comprising 1,129 manually annotated sentences containing URLs along with their expanded contexts. Our model outperforms all baseline classifiers, including a large language model-based approach, with the best F 1 -score of 90%. The dataset and source code are publicly available at: https://github.com/lamps-lab/EnSU/tree/main."
},{
  "id": "https://waingram.github.io/publications/aboelnaga2025identifying.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "aboelnaga2025identifying",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Identifying Future Work Chapters in Electronic Theses and Dissertations",
  "authors": ["Aboelnaga, Amr","Klair, Hajra","Eldardiry, Hoda","Ingram, William A."],
  "year": "2025",
  "date_published": "2025",
  "container_title": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’25",
  "pages": "177–186",
  "location": "Virtual Event",
  "doi": "10.1109/JCDL67857.2025.00029",
  "doi_url": "https://doi.org/10.1109/JCDL67857.2025.00029",
  "url": "https://doi.org/10.1109/JCDL67857.2025.00029",
  "abstract": "Electronic Theses and Dissertations (ETDs) contain conclusion and future work chapters that are difficult to locate automatically due to highly variable chapter titles across disciplines and institutions, limiting large-scale synthesis and discovery. We investigate automatic detection of conclusion/future-work–related chapters, operationalized with seven labels (conclusions, summary, discussion, future work, recommendations, limitations, implications) at the start-page level, across 299 ETDs with 334 annotated positives spanning seven academic domains. We compare heading-driven baselines (GROBID and LayoutLMv3 for heading extraction, paired with lexical, semantic, NLI, and LLM classifiers) against a modular LLM system with three components: (1) layout-preserving text extraction, (2) LLM-based page filtering to retain likely chapter starts, and (3) LLM chapter detection. We systematically test combinations of these components (referred to as stages throughout) to isolate individual contributions. Evaluation is page-level with exact start-page matching. Our best result (Llama 4 Scout, Stage 2+3) outperforms the strongest baseline (LayoutLMv3–LLM). Stage 2 substantially improves precision, while Stage 1 has mixed, generally modest effects across models. Mistral Small achieves the highest precision, whereas Llama 3.3 yields the highest recall, underscoring model trade-offs. We release prompts and configurations for reproducibility and highlight compute–accuracy considerations, showing that lightweight LLM-based page filtering combined with LLM chapter detection is a practical, effective strategy for surfacing conclusion/future-work content in long, heterogeneous ETDs."
},{
  "id": "https://waingram.github.io/publications/ingram2025evaluating.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2025evaluating",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Evaluating the Impact of Automated Labeling on Retrieval Instability in Neural IR",
  "authors": ["Ingram, William A."],
  "year": "2025",
  "date_published": "2025",
  "container_title": "Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval",
  "booktitle": "Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval",
  "series": "SIGIR ’25",
  "pages": "4209",
  "publisher": "Association for Computing Machinery",
  "location": "Padua, Italy",
  "doi": "10.1145/3726302.3730128",
  "doi_url": "https://doi.org/10.1145/3726302.3730128",
  "url": "https://doi.org/10.1145/3726302.3730128",
  "note": "Doctoral Consortium Paper",
  "abstract": "\n               Effective information retrieval (IR) depends on accurate relevance classification. But when the criteria are subjective or underspecified, small variations in classification can cause consequential shifts in retrieval results. The potential for such variability becomes critical for institutions when they use IR for research assessment. Retrieval instability can lead to relevant literature being overlooked, hindering a comprehensive understanding of the research landscape, and potentially undermining the validity of subsequent analyses and decisions.\n               We investigate this problem within the context of the United Nations Sustainable Development Goals (SDGs), a global framework for addressing environmental, social, and economic challenges. Scholarly research is vital for understanding, implementing, and monitoring SDG progress. Universities report SDG-related research to demonstrate impact, and international rankings incorporate SDG alignment into evaluations, influencing funding, policy, and institutional strategy. However, the nuanced nature of the SDGs makes it difficult to define what constitutes an SDG contribution [1]. Commonly used Boolean queries and controlled vocabularies for SDG retrieval cannot reliably differentiate substantive contributions (based on semantic relevance) from mere term occurrences.\n               In prior work, Large Language Models (LLMs) have been used to filter Boolean search results in systematic reviews by scoring documents for relevance to a specific information need [2]. Other studies demonstrate that LLMs can generate high-quality relevance labels for IR evaluation [4]. This prompted an investigation into using LLMs to judge SDG contribution through relevance filtering, which revealed variability in the judgments made by different LLMs on the same set of documents [3]. This observation suggests that the classification behavior of LLMs are sensitive to the specific parameters inherent to each model.\n               In this study, we prompt multiple LLMs to judge the SDG relevance of abstracts retrieved using Boolean queries. Abstracts judged relevant are used as positive training examples for fine-tuning multi-label SDG classifiers. We use these classifiers to simulate retrieval, applying fixed scoring functions to isolate fluctuations in ranking stability attributable to the different LLM relevance judgments. Our goal is to analyze how the structured signal of upstream inconsistencies in LLM-derived relevance judgments manifests as variations in retrieval outcomes, providing a novel lens for investigating ranking stability under classification uncertainty. This research centers on three key questions:\n               RQ1: How do different LLMs diverge in their filtering decisions, and what effect does this have on ranking stability in retrieval systems trained on filtered data?\n               RQ2: Can divergence in labeling decisions be systematically explained or predicted from document content?\n               RQ3: What distinguishes documents where LLMs disagree on relevance, and can these differences be predicted from lexical or surface-level features?\n               Using SDG classification as a case study of subjective relevance, we evaluate retrieval stability under classification uncertainty and address broader concerns regarding the reproducibility of LLM-based classification pipelines and their downstream effects."
},{
  "id": "https://waingram.github.io/publications/cheng2025vtechagp.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "cheng2025vtechagp",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models",
  "authors": ["Cheng, Ming","Gong, Jiaying","Yuan, Chenhan","Ingram, William A","Fox, Edward","Eldardiry, Hoda"],
  "year": "2025",
  "date_published": "2025",
  "container_title": "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
  "booktitle": "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
  "pages": "6110–6130",
  "publisher": "Association for Computational Linguistics",
  "doi": "10.18653/v1/2025.naacl-long.311",
  "doi_url": "https://doi.org/10.18653/v1/2025.naacl-long.311",
  "url": "https://doi.org/10.18653/v1/2025.naacl-long.311",
  "abstract": "Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives. Results demonstrate that the SOTA LLMs do not provide satisfactory outcomes, while the lightweight DSPT5 can achieve competitive results. To the best of our knowledge, we are the first to build a benchmark dataset and solutions for academic-to-general-audience text paraphrase dataset. Models will be public after acceptance."
},{
  "id": "https://waingram.github.io/publications/choudhury2024etdpc.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "choudhury2024etdpc",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations",
  "authors": ["Choudhury, Muntabir Hasan","Salsabil, Lamia","Ingram, William A.","Fox, Edward A.","Wu, Jian"],
  "year": "2024",
  "date_published": "2024",
  "container_title": "Thirty-Eighth AAAI Conference on Artificial Intelligence",
  "booktitle": "Thirty-Eighth AAAI Conference on Artificial Intelligence",
  "series": "AAAI 2024",
  "pages": "22878–22884",
  "publisher": "AAAI Press",
  "location": "Vancouver, Canada",
  "doi": "10.1609/AAAI.V38I21.30324",
  "doi_url": "https://doi.org/10.1609/AAAI.V38I21.30324",
  "url": "https://doi.org/10.1609/AAAI.V38I21.30324",
  "abstract": "Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference and journal papers. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, to discover and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents, and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 – 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation)."
},{
  "id": "https://waingram.github.io/publications/chekuri2023integrated.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "chekuri2023integrated",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Integrated Digital Library System for Long Documents and their Elements",
  "authors": ["Chekuri, Satvik","Chandrasekar, Prashant","Banerjee, Bipasha","Park, Sung Hee","Masrourisaadat, Nila","Ahuja, Aman","Ingram, William A.","Fox, Edward A."],
  "year": "2023",
  "date_published": "2023",
  "container_title": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’23",
  "pages": "13–24",
  "publisher": "IEEE Press",
  "location": "Santa Fe, New Mexico, USA",
  "doi": "10.1109/JCDL57899.2023.00012",
  "doi_url": "https://doi.org/10.1109/JCDL57899.2023.00012",
  "url": "https://doi.org/10.1109/JCDL57899.2023.00012",
  "note": "Nominated for Best Student Paper Award",
  "abstract": "We describe a next-generation integrated Digital Library (DL) system that addresses the numerous goals associated with long documents such as Electronic Theses and Dissertations (ETDs). Our extensible workflow-centric design supports a variety of users/personas (e.g., researchers, curators, and experimenters) who can benefit from improved access to ETDs and the content buried therein. Our approach leverages natural language processing, deep learning, information retrieval, and software engineering methods. The services cover ingesting, storing, curating, analyzing, detecting, extracting, classifying, summarizing, topic modeling, browsing, searching, retrieving, recommending, visualizing/reporting, and interacting with ETDs and derivative text/image-based elements/objects. Workflows connect the services and their APIs, along with UI-based access. We believe our approach can guide others to combine tailored user support, research, and education by way of extensible DLs."
},{
  "id": "https://waingram.github.io/publications/choudhury2023metaenhance.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "choudhury2023metaenhance",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries",
  "authors": ["Choudhury, Muntabir Hasan","Salsabil, Lamia","Jayanetti, Himarsha R.","Wu, Jian","Ingram, William A.","Fox, Edward A."],
  "year": "2023",
  "date_published": "2023",
  "container_title": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’23",
  "pages": "61–65",
  "publisher": "IEEE Press",
  "location": "Santa Fe, New Mexico, USA",
  "doi": "10.1109/JCDL57899.2023.00019",
  "doi_url": "https://doi.org/10.1109/JCDL57899.2023.00019",
  "url": "https://doi.org/10.1109/JCDL57899.2023.00019",
  "note": "Best Short Paper Award",
  "abstract": "Metadata quality is crucial for discovering digital objects through digital library (DL) interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence (AI) methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We evaluated MetaEnhance against this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores ranging from 0.85 to 1.00 for correcting five of seven key metadata fields. The codes and data are publicly available on GitHub: https://github.com/lamps-lab/ETDMiner/tree/master/metadata_correction."
},{
  "id": "https://waingram.github.io/publications/kahu2021scanbank.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "kahu2021scanbank",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations",
  "authors": ["Kahu, Sampanna Yashwant","Ingram, William A.","Wu, Jian","Fox, Edward A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’21",
  "pages": "565–566",
  "publisher": "IEEE Press",
  "location": "Virtual Event",
  "doi": "10.1109/JCDL52503.2021.00030",
  "doi_url": "https://doi.org/10.1109/JCDL52503.2021.00030",
  "url": "https://doi.org/10.1109/JCDL52503.2021.00030",
  "abstract": "We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin."
},{
  "id": "https://waingram.github.io/publications/choudhury2021autometa.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "choudhury2021autometa",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations",
  "authors": ["Choudhury, Muntabir Hasan","Jayanetti, Himarsha R.","Wu, Jian","Ingram, William A.","Fox, Edward A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’21",
  "pages": "230–233",
  "publisher": "IEEE Press",
  "location": "Virtual Event",
  "doi": "10.1109/JCDL52503.2021.00066",
  "doi_url": "https://doi.org/10.1109/JCDL52503.2021.00066",
  "url": "https://doi.org/10.1109/JCDL52503.2021.00066",
  "abstract": "Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive https://tinvurl.com/y8kxzwrp and a GitHub repository https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf, respectively."
},{
  "id": "https://waingram.github.io/publications/choudhury2020heuristic.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "choudhury2020heuristic",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations",
  "authors": ["Choudhury, Muntabir Hasan","Wu, Jian","Ingram, William A.","Fox, Edward A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "booktitle": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "series": "JCDL ’20",
  "pages": "515–516",
  "publisher": "Association for Computing Machinery",
  "location": "Virtual Event, China",
  "doi": "10.1145/3383583.3398590",
  "doi_url": "https://doi.org/10.1145/3383583.3398590",
  "url": "https://doi.org/10.1145/3383583.3398590",
  "abstract": "Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs."
},
    {
  "id": "https://waingram.github.io/publications/obadage2025wadl.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "obadage2025wadl",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Toward Robust URL Extraction for Open Science: A Study of arXiv File Formats and Temporal Trends",
  "authors": ["Obadage, Rochana R.","Salsabil, Lamia","Alam, Sawood","Ingram, William A.","Banarjee, Bipasha","Fox, Edward A.","Wu, Jian"],
  "year": "2025",
  "date_published": "2025",
  "container_title": "36th ACM Conference on Hypertext and Social Media",
  "booktitle": "36th ACM Conference on Hypertext and Social Media",
  "maintitle": "Web Archiving and Digital Libraries (WADL) Workshop 2025",
  "series": "HT 2025",
  "location": "Chicago, Illinois, USA",
  "url": "https://wadlworkshop.github.io/2025/content/3-acceptedpapers.html",
  "note": "Hybrid event",
  "abstract": "In this work, we study how URL extraction results depend on input format. We compiled a pilot dataset by extracting URLs from 10 arXiv papers and used the same heuristic method to extract URLs from four formats derived from the PDF files or the source LaTeX files. We found that accurate and complete URL extraction from any single format or a combination of multiple formats is challenging, with the best F1-score of 0.71. Using the pilot dataset, we evaluate extraction performance across formats and show that structured formats like HTML and XML produce more accurate results than PDFs or Text. Combining multiple formats improves coverage, especially when targeting research-critical resources. We further apply URL extraction on two tasks, namely classifying URLs into open-access datasets and software and the others, and analyzing the trend of URLs usage in arXiv papers from 1992 to 2024. These results suggest that using a combination of multiple formats achieves better performance on URL extraction than a single format, and the number of URLs in arXiv papers has been steadily increasing since 1992 to 2014 and has been drastically increasing from 2014 to 2024. The dataset and the Jupyter notebooks used for the preliminary analysis are publicly available at https://github.com/lamps-lab/arxiv-urls."
},{
  "id": "https://waingram.github.io/publications/ingram2025llm4eval.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2025llm4eval",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search",
  "authors": ["Ingram, William A.","Banerjee, Bipasha","Fox, Edward A."],
  "year": "2025",
  "date_published": "2025",
  "container_title": "48th International ACM SIGIR Conference on Research and Development in Information Retrieval",
  "booktitle": "48th International ACM SIGIR Conference on Research and Development in Information Retrieval",
  "maintitle": "LLM4Eval \\@SIGIR 2025: The Third Workshop on Large Language Models for Evaluation in Information Retrieval",
  "location": "Padua, Italy",
  "url": "https://llm4eval.github.io/SIGIR2025/papers/",
  "abstract": "Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines, especially in domains lacking human-labeled data. However, different models often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval. This study examines labeling disagreement between two open-weight LLMs, LLaMA and Qwen, on a corpus of scholarly abstracts related to Sustainable Development Goals (SDGs) 1, 3, and 7. We isolate disagreement subsets and examine their lexical properties, rank-order behavior, and classification predictability. Our results show that model disagreement is systematic, not random: disagreement cases exhibit consistent lexical patterns, produce divergent top-ranked outputs under shared scoring functions, and are distinguishable with AUCs above 0.74 using simple classifiers. These findings suggest that LLM-based filtering introduces structured variability in document retrieval, even under controlled prompting and shared ranking logic. We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in policy-relevant or thematic search tasks."
},{
  "id": "https://waingram.github.io/publications/banerjee2024automating.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "banerjee2024automating",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Automating Chapter-Level Classification for Electronic Theses and Dissertations",
  "authors": ["Banerjee, Bipasha","Ingram, William A.","Fox, Edward A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "2024 IEEE International Conference on Big Data",
  "booktitle": "2024 IEEE International Conference on Big Data",
  "maintitle": "The 7th Computational Archival Science (CAS) Workshop",
  "series": "BigData ’24",
  "pages": "2400–2409",
  "publisher": "IEEE",
  "location": "Washington, DC, USA",
  "doi": "10.1109/BigData62323.2024.10825418",
  "doi_url": "https://doi.org/10.1109/BigData62323.2024.10825418",
  "url": "https://doi.org/10.1109/BigData62323.2024.10825418",
  "abstract": "Traditional archival practices for describing electronic theses and dissertations (ETDs) rely on broad, high-level metadata schemes that fail to capture the depth, complexity, and interdisciplinary nature of these long scholarly works. The lack of detailed, chapter-level content descriptions impedes researchers’ ability to locate specific sections or themes, thereby reducing discoverability and overall accessibility. By providing chapter-level metadata information, we improve the effectiveness of ETDs as research resources. This makes it easier for scholars to navigate them efficiently and extract valuable insights. The absence of such metadata further obstructs interdisciplinary research by obscuring connections across fields, hindering new academic discoveries and collaboration. In this paper, we propose a machine learning and AI-driven solution to automatically categorize ETD chapters. This solution is intended to improve discoverability and promote understanding of chapters. Our approach enriches traditional archival practices by providing context-rich descriptions that facilitate targeted navigation and improved access. We aim to support interdisciplinary research and make ETDs more accessible. By providing chapter-level classification labels and using them to index in our developed prototype system, we make content in ETD chapters more discoverable and usable for a diverse range of scholarly needs. Implementing this AI-enhanced approach allows archives to serve researchers better, enabling efficient access to relevant information and supporting deeper engagement with ETDs. This will increase the impact of ETDs as research tools, foster interdisciplinary exploration, and reinforce the role of archives in scholarly communication within the data-intensive academic landscape."
},{
  "id": "https://waingram.github.io/publications/ahuja2023new.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ahuja2023new",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "A New Annotation Method and Dataset for Layout Analysis of Long Documents",
  "authors": ["Ahuja, Aman","Dinh, Kevin","Dinh, Brian","Ingram, William A.","Fox, Edward"],
  "year": "2023",
  "date_published": "2023",
  "container_title": "Companion Proceedings of the ACM Web Conference 2023",
  "booktitle": "Companion Proceedings of the ACM Web Conference 2023",
  "maintitle": "3rd International Workshop on Scientific Knowledge Representation, Discovery, and Assessment (Sci-K 2023)",
  "series": "WWW ’23 Companion",
  "pages": "834–842",
  "publisher": "Association for Computing Machinery",
  "location": "Austin, TX, USA",
  "doi": "10.1145/3543873.3587609",
  "doi_url": "https://doi.org/10.1145/3543873.3587609",
  "url": "https://doi.org/10.1145/3543873.3587609",
  "abstract": "Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories."
},{
  "id": "https://waingram.github.io/publications/salsabil2022reproducibility.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "salsabil2022reproducibility",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "A Study of Computational Reproducibility Using URLs Linking to Open Access Datasets and Software",
  "authors": ["Salsabil, Lamia","Wu, Jian","Choudhury, Muntabir Hasan","Ingram, William A.","Fox, Edward A.","Rajtmajer, Sarah M.","Giles, C. Lee"],
  "year": "2022",
  "date_published": "2022",
  "container_title": "Companion Proceedings of the Web Conference 2022",
  "booktitle": "Companion Proceedings of the Web Conference 2022",
  "maintitle": "Sci-K 2022 - International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment",
  "series": "WWW ’22 Companion",
  "pages": "784–788",
  "location": "Virtual Event, Lyon, France",
  "doi": "10.1145/3487553.3524658",
  "doi_url": "https://doi.org/10.1145/3487553.3524658",
  "url": "https://doi.org/10.1145/3487553.3524658",
  "abstract": "Datasets and software packages are considered important resources that can be used for replicating computational experiments. With the advocacy of Open Science and the growing interest of investigating reproducibility of scientific claims, including URLs linking to publicly available datasets and software packages has become an institutionalized part of research publications. In this preliminary study, we investigated the disciplinary dependency and chronological trends of including open access datasets and software (OADS) in electronic theses and dissertations (ETDs), based on a hybrid classifier called OADSClassifier, consisting of a heuristic and a supervised learning model. The classifier achieves the best F1 of 0.92. We found that the inclusion of OADS-URLs exhibited a strong disciplinary dependence and the fraction of ETDs containing OADS-URLs has been gradually increasing over the past 20 years. We developed and share a ground truth corpus consisting of 500 manually labeled sentences containing URLs from scientific papers. The dataset and source code are available at https://github.com/lamps-lab/oadsclassifier."
},{
  "id": "https://waingram.github.io/publications/banerjee2022applications.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "banerjee2022applications",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Applications of Data Analysis on Scholarly Long Documents",
  "authors": ["Banerjee, Bipasha","Ingram, William A.","Wu, Jian","Fox, Edward A."],
  "year": "2022",
  "date_published": "2022",
  "container_title": "2022 IEEE International Conference on Big Data",
  "booktitle": "2022 IEEE International Conference on Big Data",
  "maintitle": "The 7th Computational Archival Science (CAS) Workshop",
  "series": "Big Data ’22",
  "pages": "2473–2481",
  "location": "Osaka, Japan",
  "doi": "10.1109/BigData55660.2022.10020935",
  "doi_url": "https://doi.org/10.1109/BigData55660.2022.10020935",
  "url": "https://doi.org/10.1109/BigData55660.2022.10020935",
  "abstract": "Theses and dissertations record the work of graduate students and are typically a requirement at the culmination of the graduate degree. Thus, they contain important information that reflects a graduate student’s exploration of their research topic. Although print submission was commonplace early on, most universities now require students to submit an electronic version. The electronic document referred to as an ETD henceforth has become the primary way of submitting, storing, and distributing graduate work. Millions of such documents have been created in the past two decades. They are maintained and stored by university libraries, digital repositories, and other academic publishing companies. These online repositories have increased access to such documents. Nonetheless, these documents fail to meet the needs of researchers, who find it challenging to find and access knowledge from such long documents. The worldwide ETD collection has increased in volume to become what is known as ‘scholarly big data’. Apart from the text body, these documents contain a myriad of other pieces of knowledge like tables, figures, definitions, literature reviews, and references. There is a growing demand amongst researchers across various domains to make this collection of scholarly documents more computationally driven. We use ideas from natural language processing, information retrieval, and machine learning to excavate knowledge from this rich information source. In this paper, we examine some of the challenges we face, identify some key areas of exploration, and discuss our methods to mitigate the challenges."
},
    {
  "id": "https://waingram.github.io/publications/ingram2024agentic.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2024agentic",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals",
  "authors": ["Ingram, William A.","Banerjee, Bipasha","Fox, Edward A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "2024 IEEE International Conference on Big Data",
  "booktitle": "2024 IEEE International Conference on Big Data",
  "series": "BigData ’24",
  "pages": "8677–8679",
  "publisher": "IEEE",
  "location": "Washington, DC, USA",
  "doi": "10.1109/BigData62323.2024.10825072",
  "doi_url": "https://doi.org/10.1109/BigData62323.2024.10825072",
  "url": "https://doi.org/10.1109/BigData62323.2024.10825072",
  "note": "Poster Presentation",
  "abstract": "As research institutions increasingly commit to supporting the United Nations’ Sustainable Development Goals (SDGs), there is a pressing need to accurately assess their research output against these goals. Current approaches, primarily reliant on keyword-based Boolean search queries, conflate incidental keyword matches with genuine contributions, reducing retrieval precision and complicating benchmarking efforts. This study investigates the application of autoregressive Large Language Models (LLMs) as evaluation agents to identify relevant scholarly contributions to SDG targets in scholarly publications. Using a dataset of academic abstracts retrieved via SDG-specific keyword queries, we demonstrate that small, locally-hosted LLMs can differentiate semantically relevant contributions to SDG targets from documents retrieved due to incidental keyword matches, addressing the limitations of traditional methods. By leveraging the contextual understanding of LLMs, this approach provides a scalable framework for improving SDG-related research metrics and informing institutional reporting."
},{
  "id": "https://waingram.github.io/publications/salsabil2024toward.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "salsabil2024toward",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale",
  "authors": ["Salsabil, Lamia","Wu, Jian","Ingram, William A.","Fox, Edward A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "2024 IEEE International Conference on Big Data",
  "booktitle": "2024 IEEE International Conference on Big Data",
  "series": "BigData ’24",
  "pages": "8825–8827",
  "publisher": "IEEE",
  "location": "Washington, DC, USA",
  "doi": "10.1109/BigData62323.2024.10825738",
  "doi_url": "https://doi.org/10.1109/BigData62323.2024.10825738",
  "url": "https://doi.org/10.1109/BigData62323.2024.10825738",
  "note": "Poster Presentation",
  "abstract": "Metadata is crucial for the accessibility, interoperability, and long-term usability of digital objects such as Electronic Theses and Dissertations (ETDs). In large-scale academic repositories, poor metadata quality can significantly impede the discovery and use of resources. This study addresses persistent issues of incomplete and inconsistent ETD metadata collected from U.S. university libraries. However, directly applying machine learning-based error detection and correction models may introduce unwanted errors due to the imperfection of these models. We propose an ETD metadata improvement system (ETDMIS) that mitigates the problem by integrating metadata validation and a version control mechanism. Our system was applied to a dataset of 100,000 U.S. ETDs, resulting in substantial improvements in metadata quality. Scalability was demonstrated by processing the entire dataset efficiently. The original and the enhanced metadata for the 100,000 ETDs are publicly accessible at https://github.com/lamps-lab/ETDMiner/tree/master/Meta100K."
},{
  "id": "https://waingram.github.io/publications/banerjee2024making.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "banerjee2024making",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Making History Readable",
  "authors": ["Banerjee, Bipasha","Goyne, Jennifer","Ingram, William A."],
  "year": "2024",
  "date_published": "2024",
  "container_title": "2024 IEEE International Conference on Big Data",
  "booktitle": "2024 IEEE International Conference on Big Data",
  "series": "BigData ’24",
  "pages": "8620–8622",
  "publisher": "IEEE",
  "location": "Washington, DC, USA",
  "doi": "10.1109/BigData62323.2024.10826028",
  "doi_url": "https://doi.org/10.1109/BigData62323.2024.10826028",
  "url": "https://doi.org/10.1109/BigData62323.2024.10826028",
  "note": "Poster Presentation",
  "abstract": "The Virginia Tech University Libraries (VTUL) Digital Library Platform (DLP) hosts digital collections that offer our users access to a wide variety of documents of historical and cultural importance. These collections are not only of academic importance but also provide our users with a glance at local historical events. Our DLP contains collections comprising digital objects featuring complex layouts, faded imagery, and hard-to-read handwritten text, which makes providing online access to these materials challenging. To address these issues, we integrate AI into our DLP workflow and convert the text in the digital objects into a machine-readable format. To enhance the user experience with our historical collections, we use custom AI agents for handwriting recognition, text extraction, and large language models (LLMs) for summarization. This poster highlights three collections focusing on handwritten letters, newspapers, and digitized topographic maps. We discuss the challenges with each collection and detail our approaches to address them. Our proposed methods aim to enhance the user experience by making the contents in these collections easier to search and navigate."
},{
  "id": "https://waingram.github.io/publications/ingram2023maximizing.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2023maximizing",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Maximizing Equitable Reach and Accessibility of ETDs",
  "authors": ["Ingram, William","Wu, Jian","Fox, Edward"],
  "year": "2023",
  "date_published": "2023",
  "container_title": "Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries",
  "series": "JCDL ’23",
  "pages": "256–257",
  "publisher": "IEEE Press",
  "location": "Santa Fe, New Mexico, USA",
  "doi": "10.1109/JCDL57899.2023.00049",
  "doi_url": "https://doi.org/10.1109/JCDL57899.2023.00049",
  "url": "https://doi.org/10.1109/JCDL57899.2023.00049",
  "note": "Poster Presentation",
  "abstract": "This poster addresses accessibility issues of electronic theses and dissertations (ETDs) in digital libraries (DLs). ETDs are available primarily as PDF files, which present barriers to equitable access, especially for users with visual impairments, cognitive or learning disabilities, or for anyone needing more efficient and effective ways of finding relevant information within these long documents. We propose using AI techniques, including natural language processing (NLP), computer vision, and text analysis, to convert PDFs into machine-readable HTML documents with semantic tags and structure, extracting figures and tables, and generating summaries and keywords. Our goal is to increase the accessibility of ETDs and to make this important scholarship available to a wider audience."
},{
  "id": "https://waingram.github.io/publications/ingram2023ai.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2023ai",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "AI and Public Archives: Collaborative Leadership for Responsible Adoption",
  "authors": ["Ingram, William A.","Dikow, Rebecca B.","Potter, Abigail","Ferriter, Meghan","Reilly, Jill"],
  "year": "2023",
  "date_published": "2023",
  "container_title": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "booktitle": "Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries",
  "series": "JCDL ’23",
  "pages": "323–324",
  "publisher": "IEEE Press",
  "location": "Santa Fe, New Mexico, USA",
  "doi": "10.1109/JCDL57899.2023.00079",
  "doi_url": "https://doi.org/10.1109/JCDL57899.2023.00079",
  "url": "https://doi.org/10.1109/JCDL57899.2023.00079",
  "note": "Panel Discussion",
  "abstract": "This panel aims to address the measures that cultural heritage institutions must undertake to ensure that their approaches to AI adoption are inclusive and equitable. Panelists from Virginia Tech, the National Archives and Records Administration, the Smithsonian Institution, and the Library of Congress will provide an overview of the current state of AI in large public-serving libraries, archives, and museums and will lead a discussion on how to incorporate AI in a responsible manner. The discussion builds on the panelists’ ongoing work to critically examine the potential impacts and implications of the use of AI technology by cultural heritage institutions and strategies for addressing and mitigating its negative effects. The panel will engage the audience in a group discussion that aims to consider possible biases and harms that may arise from the use of automated technologies, build inclusive practices in AI development and implementation, and form new partnerships to promote the use of ethical AI in libraries, archives, and museums."
},{
  "id": "https://waingram.github.io/publications/ingram2022etds.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2022etds",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Electronic Theses and Dissertations: A Research Corpus of Scholarly Big Data",
  "authors": ["Ingram, William A.","Wu, Jian","Fox, Edward A."],
  "year": "2022",
  "date_published": "2022",
  "container_title": "GL24 — Twenty-Fourth International Conference on Grey Literature, GreyNet International",
  "howpublished": "GL24 — Twenty-Fourth International Conference on Grey Literature, GreyNet International",
  "url": "https://doi.org/10.5446/59869",
  "abstract": "Thanks to the efforts of university libraries, graduate programs, and the open repository movement, millions of Electronic Theses and Dissertations (ETDs) are publicly disseminated online. This enormous volume of scholarship exhibits many interesting characteristics, which make it a valuable corpus for developing new technologies based on computational analysis of academic writing. Digital archives of scholarly publications have been used to support research, but ETDs are unique in that they are much longer than most conference papers and journal articles. ETDs contain novel ideas and findings that contribute significantly to the subject areas of their authors. They often contain useful figures, tables, and equations, as well as extensive literature reviews, bibliographies, and links to other publications. As grey literature, access to ETDs is not controlled by commercial publishers, copyright belongs to the authors, and most are disseminated under permissive copyright licenses.\n                  We have constructed a large document corpus consisting of full-text PDFs and metadata for more than 500,000 ETDs retrieved from university institutional repositories across the United States. The ETD corpus supports research projects conducted by librarians, computer science faculty, undergraduates, master’s students, and doctoral students studying natural language processing, information retrieval, bibliometrics, language modeling, and other areas of investigation related to scholarly big data. So far, analysis of the ETD corpus has aided the creation of new models for extracting figures and tables from academic papers, segmenting long documents into chapters and sections, topic modeling algorithms, document classification, summarization algorithms, and improved digital library user interfaces. "
},{
  "id": "https://waingram.github.io/publications/ahuja2022analyzing.html",
  "type": "CreativeWork",
  "bibtex_key": "ahuja2022analyzing",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Analyzing and Navigating ETDs Using Topic Models",
  "authors": ["Ahuja, Aman","Ingram, William A.","Mao, Chenyu","He, Chongyu","Wei, Jianchi","Fox, Edward A."],
  "year": "2022",
  "date_published": "2022",
  "container_title": "25th International Symposium on Electronic Theses and Dissertations",
  "howpublished": "25th International Symposium on Electronic Theses and Dissertations",
  "location": "Novi Sad, Serbia",
  "url": "https://etd2022.uns.ac.rs/",
  "abstract": "Electronic theses and dissertations (ETDs) contain valuable knowledge that can be useful in a wide range of research areas. Accordingly, we are building electronic infrastructure leveraging advanced work on digital libraries, for discovering and accessing the knowledge buried in ETDs. In this paper we focus on our work to incorporate topic modeling into digital libraries for ETDs. We present ETD-Topics, a framework that extracts topics from a large text corpus in an unsupervised way. The representations learnt from topic models can be useful for downstream tasks such as searching and/or browsing documents by topic, document recommendation, topic recommendation, and describing temporal topic trends (e.g., from the perspective of disciplines or universities)."
},{
  "id": "https://waingram.github.io/publications/uddin2021building.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "uddin2021building",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Building A Large Collection of Multi-domain Electronic Theses and Dissertations",
  "authors": ["Uddin, Sami","Banerjee, Bipasha","Wu, Jian","Ingram, William A.","Fox, Edward A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "2021 IEEE International Conference on Big Data",
  "booktitle": "2021 IEEE International Conference on Big Data",
  "series": "BigData ’21",
  "pages": "6043–6045",
  "publisher": "IEEE",
  "location": "Orlando, FL, USA",
  "doi": "10.1109/BIGDATA52589.2021.9672058",
  "doi_url": "https://doi.org/10.1109/BIGDATA52589.2021.9672058",
  "url": "https://doi.org/10.1109/BIGDATA52589.2021.9672058",
  "note": "Poster Presentation",
  "abstract": "In this work, we report our progress on building a collection containing over 450k Electronic Theses and Dissertations (ETDs), including full-text and metadata. Our goal is to close the gap of accessibility between long text and short text documents, and to create a new research opportunity for the scholarly community. For that, we developed an ETD Ingestion Framework (EIF) that automatically harvests metadata and PDFs of ETDs from university libraries. We faced multiple challenges and learned many lessons during the process, that led to proposed solutions to overcome/mitigate the limitations of the current data. We also described the data that we have collected. We hope our methods will be useful for building similar collections from university libraries and that the data can be used for research and education."
},{
  "id": "https://waingram.github.io/publications/banerjee2021applications.html",
  "type": "CreativeWork",
  "bibtex_key": "banerjee2021applications",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Applications of Mining ETDs",
  "authors": ["Banerjee, Bipasha","Ingram, William A.","Wu, Jian","Fox, Edward A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "24th International Symposium on Electronic Theses and Dissertations",
  "howpublished": "24th International Symposium on Electronic Theses and Dissertations",
  "location": "Abu Dhabi, UAE",
  "url": "https://doi.org/10.26226/morressier.614c9b8c87a68d83cb5d59b2",
  "abstract": "Theses and dissertations contain a wealth of knowledge reflecting graduate students’ exploration in a scholarly domain. Although print submission was common practice early on, ETDs have become the predominant format for submitting, archiving, and disseminating graduate work. Over the past 25 years, millions of ETDs have been created, collected, and shared with the world through online digital repositories run by university libraries and scholarly publishing companies. Paper theses and dissertations have been replaced with PDFs; but for the most part, digital collections aren’t much different than the analog libraries they replaced. Online digital libraries of ETDs have greatly increased the exposure of graduate research, nonetheless, they fail to meet the needs of researchers, who find it hard to discover and access the knowledge buried in these long documents. The worldwide collection of ETDs has grown to become “scholarly big data” (Giles, 2013), consisting of myriad facts and descriptions of new knowledge, tables and figures, terms and definitions, references and literature reviews. There is a growing demand among researchers for collections of scholarly content to support computationally-driven research. This paper describes our efforts to create a computationally amenable corpus of ETDs. We use ideas and techniques from bibliometrics, machine learning, information retrieval, and natural language processing to mine knowledge from this rich information source. We examine some of the challenges we face, discuss our methods, and explore the results.\n                  Mining ETDs can be challenging as they are scattered across countless repositories and digital libraries. Despite efforts to establish standards of interoperability among scholarly repositories, accessing full-text on a large scale is surprisingly difficult. We set out with a goal of building a research corpus of at least 200,000 ETDs and their associated metadata from open repositories across the U.S. Harvesting full-text PDFs from institutional repositories involves creating extemporaneous web crawling scripts, most of which only work for the individual repository they were created for. Once downloaded, full-text representations must be extracted from PDF documents. This process can vary depending on whether the ETDs were “born digital” or if they were created by scanning paper documents.\n                  Modern advances in text mining and analytics have equipped researchers with new tools and novel ways to extract knowledge and understanding from text. Most techniques have been developed and tested on shorter documents, such as web pages and news articles. But ETDs are book-length documents. Like books, ETDs are organized into chapters and sections. A key aspect of text mining is establishing structure in unstructured data. For ETDs, this is a non-trivial process because, unlike some other digital formats like XML, PDF is an unstructured data format - so the structure of an ETD (e.g., chapters and sections) is usually not machine-readable. It would be useful to extract single chapters from an ETD so that they can be analyzed individually. Automatic chapter segmentation and extraction facilitate many downstream benefits. For instance, most ETDs contain deep and well-researched literature reviews.\n                  Extracted literature review chapters from ETDs could be indexed and made available as useful documents in their own right. We discuss how effective chapter segmentation can be applied to a large corpus of ETDs algorithmically. Our approach to chapter segmentation uses machine learning to predict which lines of text represent chapter headings based on lexical and syntactic features extracted from the text.\n                  Document classification and categorization is a long-established intellectual practice of libraries that is indispensable to information organization. However, the task of manually classifying millions of ETDs is untenable, so the onus has generally fallen on authors to assign subject categories or keywords to their own work. We demonstrate how subject categories can be generated for ETDs automatically using machine learning. Moreover, we show that classification can be done at the chapter level.\n                  The popularity of interdisciplinary research is surging (Millar, 2013). As universities encourage interdisciplinary approaches to research, the trend is born out in graduate research output, including ETDs. Using techniques from information extraction and natural language processing, we demonstrate how research topics can be mined from the text of ETDs, we explore changes in popularity of graduate research topics over time and examine the evolution of interdisciplinarity in graduate research. In addition, we show how chapter-level classification can be used to more accurately describe an interdisciplinary ETD, thus increasing its potential for discovery and impact.\n                  In addition to algorithmic classification, we explore ways of automatically summarizing ETDs and their chapters. Automatic summarization aims to identify the most important information in a document and express this information to the reader in a concise, factually correct format (Wu et al., 2021). Most ETDs contain an abstract that broadly describes the work. However, for many of the reasons mentioned above, it is useful to provide chapter-level summarization. Despite the wealth of knowledge and information contained in ETDs, the documents are simply too long to be considered by today’s busy researchers, who are already deluged with the vast amount of scholarly literature available to them. Providing a summary for each chapter helps researchers quickly identify individual chapters of interest and provide a point of entry for reading.\n                  Scholarly text mining is gaining popularity with researchers as its methods have been shown to identify unseen patterns and uncover new knowledge. Our research explores how a large corpus of ETDs can be made computationally amenable and demonstrates various applications of text mining and information extraction. We believe this work will lead to expanded service offerings by libraries, encourage other researchers to use ETDs for computational analysis, and ultimately raise the impact of graduate research.\n                  References\n                  Giles, C. L. (2013). Scholarly big data: Information extraction and data mining. Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 1–2. https://doi.org/10.1145/2505515.2527109\n                  Millar, M. M. (2013). Interdisciplinary research and the early career: The effect of interdisciplinary dissertation research on career placement and publication productivity of doctoral graduates in the sciences. Research Policy, 42 (5), 1152–1164. https://doi.org/10.1016/j.respol.2013.02.004\n                  Wu, Z., Koncel-Kedziorski, R., Ostendorf, M., & Hajishirzi, H. (2021). Extracting summary knowledge graphs from long documents. arXiv:2009.09162 [cs]. Retrieved August 15, 2021, from https://arxiv.org/abs/2009.09162"
},{
  "id": "https://waingram.github.io/publications/ingram2021serverless.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2021serverless",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Why and How We Went Serverless, and How You Can Too",
  "authors": ["Chen, Yinlin","Ingram, William A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "CNI: Coalition for Networked Information Spring 2021 Membership Meeting",
  "howpublished": "CNI: Coalition for Networked Information Spring 2021 Membership Meeting",
  "url": "https://www.cni.org/topics/digital-curation/why-and-how-we-went-serverless-and-how-you-can-too",
  "abstract": "In this presentation, we will share our experience of adopting serverless techniques and building the next generation of the digital library platform in the AWS cloud. We use this platform to manage complex digital objects and preserve large-scale datasets, which was very challenging for us to build on-premise on a similar scale in storage, networking, scalability, availability, etc. We further present how serverless removes technical barriers and how we now can take a more precise cost management control, resource utilization, and automation we have never been able to achieve before."
},{
  "id": "https://waingram.github.io/publications/ingram2020mining.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2020mining",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Mining ETDs for Trends in Graduate Research",
  "authors": ["Ingram, William A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "CNI: Coalition for Networked Information Fall 2020 Membership Meeting",
  "howpublished": "CNI: Coalition for Networked Information Fall 2020 Membership Meeting",
  "url": "https://www.cni.org/topics/electronic-theses-dissertations-etds/mining-etds-for-trends-in-graduate-research",
  "abstract": "Our ongoing research project applies computational analysis and text mining techniques to a large corpus of electronic theses and dissertations (ETDs) in order to gain insight into the evolution of graduate research topics. We analyze a dataset made up of over 1.3 million full-text ETDs and their associated metadata, spanning the years 2000 to 2018, accessed via the ProQuest TDM Studio. We employ methods such as co-occurrence graph analysis to visualize trends in the data and draw conclusions by analyzing its evolution. We share the insights gained through text and data mining the ETD corpus, how different topics and disciplines overlap and thus map the interdisciplinarity among them, the evolution of interdisciplinarity in graduate research, and areas of scholarly growth within and across disciplines. This project was supported in part by ProQuest, which provided access to TDM Studio and the ProQuest Dissertations & Theses Global corpus. This project was also made possible in part by the Institute of Museum and Library Services (lg-37-19-0078-19)."
},{
  "id": "https://waingram.github.io/publications/tuttle2020multitenancy.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "tuttle2020multitenancy",
  "bibtex_type": "inproceedings",
  "publication_category": "uncategorized",
  "title": "Multi-tenancy Cloud Access and Preservation: Virginia Tech Digital Libraries Platform",
  "authors": ["Tuttle, James","Chen, Yinlin","Jiang, Tingting","Hunter, Lee","Waldren, Andrea","Ghosh, Soumik","Ingram, William A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "booktitle": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "series": "JCDL ’20",
  "pages": "557–558",
  "publisher": "Association for Computing Machinery",
  "location": "Virtual Event, China",
  "doi": "10.1145/3383583.3398624",
  "doi_url": "https://doi.org/10.1145/3383583.3398624",
  "url": "https://doi.org/10.1145/3383583.3398624",
  "abstract": "Virginia Tech Libraries has developed a cloud-native, microservervices-based digital libraries platform to consolidate diverse access and preservation infrastructure into a set of flexible, independent microservices in Amazon Web Services. We have been an implementer and contributor to various community digital library and repository projects including DSpace, Fedora, and Samvera3. However, the complexity and cost of maintaining disparate application stacks have reduced our capacity to build new infrastructure. Virginia Tech has a long history of participation in and contribution to community-driven Open Source projects and has, in that time, developed more than a dozen independent applications architected on these stacks. The cost of independently addressing vulnerabilities, which often requires work to mitigate incompatibilities; reworking each application to comply with developing branding guidelines; and feature development and improvement has burgeoned, threatening to overwhelm our capacity. Like many of our peers5, our maintenance obligations have made continued growth unsustainable and have pushed older applications to near abandonware. We have designed and developed the Digital Libraries Platform to address these concerns thus reducing our maintenance obligations and costs associated with feature development across digital libraries. This approach represents a departure from the monolithic architectures of our legacy systems and, as such, shares more infrastructure among individual digital library implementations. The shared infrastructure facilitates rapid inclusion of new and improved features into each digital library instance. New features can be developed independent of any digital library instance and integrated into that instance by inclusion of that feature in the React/Amplify template. Changes to the template super class, such as those necessitated by evolving branding guidelines, may be immediately inherited by the template instances that subscribe to it. The platform implements Terraform6 deployment templates, Lambda serverless functions, and other cloud assets to form a microservices architecture on which multiple template-based sites are built. Individual sites are configured in AWS DynamoDB, Amazon’s NoSQL database service, and via modification of shared template. Additional services provide digital preservation support including auditing, file fixity validation, replication to external cloud storage providers, file format characterization, and deposit to third-party preservation services. This presentation also discusses the cost of operating these services in AWS and strategies for mitigating those costs. These strategies include containerization to allow deployment of high-cost, asynchronous services to local infrastructure to take full advantage of existing infrastructure and advantageous utility pricing while allowing for local redeployment. In the past, developers worked in local, independent environments. New features and fixes were submitted to a central development environment testing and validation, which significantly slowed development. Migrating development, review, integration, and deployment processes to AWS decreased the time and resource bottlenecks for those processes. Our AWS cost accounting demonstrates an 87% savings over our traditional, on-premises Fedora/Samvera approach For a team of four software developers, the total cost using a traditional server-based (a t2-medium EC2 instance) development approach is about 133 per month versus our serverless-based development approach using AWS Amplify at an average of 17 per month. As the Digital Libraries Platform project expands, we anticipate publishing a set of API documents allowing us and others to reimplement specific microservices independent of the architecture."
},{
  "id": "https://waingram.github.io/publications/ingram2019computational.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2019computational",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Bringing Computational Access to Book-length Documents Via an ETD Pilot",
  "authors": ["Ingram, William A."],
  "year": "2019",
  "date_published": "2019",
  "container_title": "CNI: Coalition for Networked Information Fall 2019 Membership Meeting",
  "howpublished": "CNI: Coalition for Networked Information Fall 2019 Membership Meeting",
  "url": "https://www.cni.org/topics/electronic-theses-dissertations-etds/bringing-computational-access-to-book-length-documents-via-an-etd-pilot",
  "abstract": "Virginia Polytechnic Institute and State University (Virginia Tech) Libraries, in collaboration with Virginia Tech Department of Computer Science and Old Dominion University Department of Computer Science, is the recipient of an IMLS National Leadership Grant for Libraries award to fund research into bringing computational access to book-length documents, through a research and piloting effort employing electronic theses and dissertations (ETDs). The three-year project is motivated by the following library and community needs:\n                  (1) Despite huge volumes of book-length documents in digital libraries, there is a lack of models offering effective and efficient computational access to these long documents.\n                  (2) Nationwide open-access services for ETDs generally function at the metadata level. Much important knowledge and scientific data lie hidden in ETDs, and we need better tools to mine the content and facilitate the identification, discovery, and reuse of these important components.\n                  (3) A wide range of audiences can potentially benefit from this research, including but not limited to librarians, students, authors, educators, researchers, and other interested readers.\n                  Our research focuses on extracting and analyzing segments of long documents (chapters, reference lists, tables, figures), as well as methods for automated classification and summarization of individual chapters of longer texts to increase findability. The project brings cutting-edge machine/deep learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. By focusing on libraries’ ETD collections, the research will enhance ETD programs, devising effective and efficient methods for opening the knowledge currently hidden in the rich body of graduate research and scholarship."
},
    {
  "id": "https://waingram.github.io/publications/ingram2020reproducibility.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2020reproducibility",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Preparing Code and Data for Computational Reproducibility",
  "authors": ["Ingram, William A.","Fox, Edward A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "booktitle": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "series": "JCDL ’20",
  "pages": "565–566",
  "publisher": "Association for Computing Machinery",
  "location": "Virtual Event, China",
  "doi": "10.1145/3383583.3398714",
  "doi_url": "https://doi.org/10.1145/3383583.3398714",
  "url": "https://doi.org/10.1145/3383583.3398714",
  "note": "Half-day tutorial",
  "abstract": "Computational analyses are playing an increasingly central role in research and are a feature of many advanced digital libraries. Journals, sponsors, and researchers, including in the digital library field, are calling for published research to include associated data and code. However, many involved in research have not received training in best practices and tools for building systems (e.g., using containers) and implementing methods that facilitate sharing code and data. This tutorial aims to address this gap in training while also providing those who support researchers with curated best practices guidance and tools.\n               This tutorial is unique compared to other reproducibility events due to its practical, step-by-step design. It is comprised of hands-on exercises to prepare research code and data for computationally reproducible publication. Although the tutorial starts with some brief introductory information about computational reproducibility, the bulk of the tutorial is guided work with data and code. The basic best practices for publishing code and data are covered with curated resources. Examples will include from the digital library and information retrieval domains. Participants move through preparing research for reuse, organization, documentation, automation, and submitting their code and data to share. Tools to support reproducibility will be introduced but all lessons will be platform agnostic."
},{
  "id": "https://waingram.github.io/publications/fox2020digitallibraries.html",
  "type": "CreativeWork",
  "bibtex_key": "fox2020digitallibraries",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Introduction to Digital Libraries",
  "authors": ["Fox, Edward A.","Ingram, William A."],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "booktitle": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020",
  "series": "JCDL ’20",
  "pages": "567–568",
  "publisher": "Association for Computing Machinery",
  "location": "Virtual Event, China",
  "doi": "10.1145/3383583.3398501",
  "doi_url": "https://doi.org/10.1145/3383583.3398501",
  "url": "https://doi.org/10.1145/3383583.3398501",
  "note": "Half-day tutorial",
  "abstract": "This tutorial is a thorough and deep introduction to the Digital Libraries (DL) field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the ’5S’ set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a ’minimal digital library’, and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies, including from multiple current projects, including with webpages, tweets, and social networks. Attendees will be exposed to four Morgan and Claypool books that elaborate on 5S, published 2012–2014. Complementing the coverage of 5S will be an overview of key aspects of the DELOS Reference Model and DL.org activities. Further, new material will be added on building digital libraries using container and cloud services, on developing a digital library for electronic theses and dissertations, and methods to integrate UX and DL design approaches."
},{
  "id": "https://waingram.github.io/publications/ingram2019hands-on.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2019hands-on",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Preparing Code and Data for Computational Reproducibility: A Hands-On Workshop",
  "authors": ["Ingram, William A.","Fox, Edward A."],
  "year": "2019",
  "date_published": "2019",
  "container_title": "22nd International Symposium on Electronic Theses and Dissertations",
  "howpublished": "22nd International Symposium on Electronic Theses and Dissertations",
  "url": "http://etd2019.upt.pt/keynote-speakers-guests/",
  "note": "Half-day tutorial",
  "abstract": "Preparing code and data for computational reproducibility: a hands-on workshop.\n                  Moderator: Suzie Allard"
},
    {
  "id": "https://waingram.github.io/publications/ingram2022ai_ethics_framework_part2.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2022ai_ethics_framework_part2",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Leading the Future of AI and Public Archives: Toward a Shared AI Ethics Framework",
  "authors": ["William A. Ingram, Virginia Tech University Libraries","Sylvester A. Johnson, Virginia Tech Center for Humanities","Abigail Potter, Library of Congress Labs","Meghan Ferriter, Library of Congress Labs","Rebecca Dikow, Smithsonian OCIO Data Science Lab","Mike Trizna, Smithsonian OCIO Data Science Lab","Jill Reilly, National Archives Office of Innovation"],
  "location": "Virtual Workshop",
  "url": "https://smithsonian.github.io/AIandPublicArchives2022/",
  "note": "Part two of a workshop series aimed at developing a shared AI Ethics Framework for galleries, libraries, archives, and museums. Keynote speakers included Afua Bruce and Isaac Johnson. Activities focused on creating an institutional AI ethics statement and operationalizing AI in LAMs.",
  "abstract": "Part two of a workshop series aimed at developing a shared AI Ethics Framework for galleries, libraries, archives, and museums. Keynote speakers included Afua Bruce and Isaac Johnson. Activities focused on creating an institutional AI ethics statement and operationalizing AI in LAMs."
},{
  "id": "https://waingram.github.io/publications/ingram2022ai_ethics_framework_part1.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2022ai_ethics_framework_part1",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Leading the Future of AI and Public Archives",
  "authors": ["William A. Ingram, Virginia Tech University Libraries","Sylvester A. Johnson, Virginia Tech Center for Humanities","Abigail Potter, Library of Congress","Meghan Ferriter, Smithsonian","Jill Reilly, National Archives Office of Innovation"],
  "location": "Virtual Workshop",
  "url": "https://smithsonian.github.io/AIandPublicArchives2022/",
  "note": "Part one of a workshop series aimed at leaders and collaborators from institutions with public digital collections and archives programs. Activities included a leadership roundtable, lessons learned, problem definition, and action priority matrix. Keynote speaker: Elham Tabassi.",
  "abstract": "Part one of a workshop series aimed at leaders and collaborators from institutions with public digital collections and archives programs. Activities included a leadership roundtable, lessons learned, problem definition, and action priority matrix. Keynote speaker: Elham Tabassi."
},{
  "id": "https://waingram.github.io/publications/ingram2021government_archives_workshop.html",
  "type": "CreativeWork",
  "bibtex_key": "ingram2021government_archives_workshop",
  "bibtex_type": "misc",
  "publication_category": "uncategorized",
  "title": "Ensuring Scholarly Access to Government Archives and Records",
  "authors": ["William A. Ingram, Virginia Tech University Libraries","Sylvester A. Johnson, Virginia Tech Center for Humanities"],
  "location": "Virtual Workshop",
  "url": "https://lib.vt.edu/research-teaching/computational-archives-workshop.html",
  "note": "Virginia Tech and NARA convened archivists, librarians, humanists, technologists, and scientists for a set of five weekly workshops to plan for ensuring future access to government records through AI and machine learning. Sponsored by the Andrew W. Mellon Foundation.",
  "abstract": "Virginia Tech and NARA convened archivists, librarians, humanists, technologists, and scientists for a set of five weekly workshops to plan for ensuring future access to government records through AI and machine learning. Sponsored by the Andrew W. Mellon Foundation."
},
    {
  "id": "https://waingram.github.io/publications/obadage2025url.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "obadage2025url",
  "bibtex_type": "article",
  "publication_category": "uncategorized",
  "title": "Toward Robust URL Extraction for Open Science: A Study of arXiv File Formats and Temporal Trends",
  "authors": ["Obadage, Rochana R.","Salsabil, Lamia","Alam, Sawood","Banarjee, Bipasha","Ingram, William A.","Fox, Edward A.","Wu, Jian"],
  "year": "2025",
  "date_published": "2025",
  "container_title": "CoRR",
  "journal": "CoRR",
  "volume": "abs/2509.04759",
  "doi": "10.48550/ARXIV.2509.04759",
  "doi_url": "https://doi.org/10.48550/ARXIV.2509.04759",
  "url": "https://doi.org/10.48550/arXiv.2509.04759",
  "abstract": "In this work, we study how URL extraction results depend on input format. We compiled a pilot dataset by extracting URLs from 10 arXiv papers and used the same heuristic method to extract URLs from four formats derived from the PDF files or the source LaTeX files. We found that accurate and complete URL extraction from any single format or a combination of multiple formats is challenging, with the best F1-score of 0.71. Using the pilot dataset, we evaluate extraction performance across formats and show that structured formats like HTML and XML produce more accurate results than PDFs or Text. Combining multiple formats improves coverage, especially when targeting research-critical resources. We further apply URL extraction on two tasks, namely classifying URLs into open-access datasets and software and the others, and analyzing the trend of URLs usage in arXiv papers from 1992 to 2024. These results suggest that using a combination of multiple formats achieves better performance on URL extraction than a single format, and the number of URLs in arXiv papers has been steadily increasing since 1992 to 2014 and has been drastically increasing from 2014 to 2024. The dataset and the Jupyter notebooks used for the preliminary analysis are publicly available at this https URL: https://github.com/lamps-lab/arxiv-urls"
},{
  "id": "https://waingram.github.io/publications/fox2024structured.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "fox2024structured",
  "bibtex_type": "patent",
  "publication_category": "uncategorized",
  "title": "Structured Document Access for Electronic Documents",
  "authors": ["Fox, Edward A.","Ahuja, Aman","Ingram, William A.","Banerjee, Bipasha","Chekuri, Satvik"],
  "year": "2024",
  "date_published": "2024",
  "url": "https://patents.google.com/patent/US20240289356A1",
  "note": "Patent No. 18/585,685, Filed August 29, 2024",
  "abstract": "Electronic documents of all kinds can be of immense value to the scholarly community. For example, many Electronic Theses and Dissertations (ETDs) are now publicly available online over public and private local and wide area networks, often through one of many digital libraries. However, since a majority of these digital libraries are institutional repositories with an objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of electronic documents easier."
},{
  "id": "https://waingram.github.io/publications/ingram2021govrecords.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ingram2021govrecords",
  "bibtex_type": "final grant report",
  "publication_category": "uncategorized",
  "title": "Ensuring Scholarly Access to Government Archives and Records",
  "authors": ["Ingram, William A.","Johnson, Sylvester A."],
  "year": "2021",
  "date_published": "2021",
  "container_title": "Virginia Tech",
  "institution": "Virginia Tech",
  "url": "http://hdl.handle.net/10919/108067",
  "note": "Sponsored by The Andrew W. Mellon Foundation.",
  "abstract": "This report summarizes the activities and outcomes of a collaborative planning project supported by The Andrew W. Mellon Foundation and organized by University Libraries at Virginia Tech, in collaboration with Virginia Tech Center for Humanities and the National Archives and Records Administration (NARA). A diverse group of archivists, librarians, humanists, technologists, information scientists, and computer scientists were convened for a five-part online workshop series to discuss and plan how artificial intelligence and machine learning could be used to ensure public access to the massive and ever-growing collection of government records in the NARA digital catalog.\n                 During the workshop, participants identified requirements, developed conceptual models, and discussed a work plan for a subsequent pilot project that would apply state-of-the-art tools and technologies to increase the effectiveness of archival programs and broaden public access to the important content in the NARA catalog. The workshop focused on humanistic and equitability issues of artificial intelligence and developing ethical, human-centered technology that promotes the public good. As such, the topic of intentional mitigation of AI bias was a thread that ran through the entirety of the workshop."
},{
  "id": "https://waingram.github.io/publications/aromando2020classification.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "aromando2020classification",
  "bibtex_type": "term paper",
  "publication_category": "uncategorized",
  "title": "Classification and Extraction of Information from ETD Documents",
  "authors": ["Aromando, John","Banerjee, Bipasha","Ingram, William A.","Jude, Palakh","Kahu, Sampanna"],
  "year": "2020",
  "date_published": "2020",
  "container_title": "Virginia Tech",
  "institution": "Virginia Tech",
  "url": "http://hdl.handle.net/10919/96645",
  "abstract": "In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning."
},{
  "id": "https://waingram.github.io/publications/ahuja2018bigdata.html",
  "type": "ScholarlyArticle",
  "bibtex_key": "ahuja2018bigdata",
  "bibtex_type": "term paper",
  "publication_category": "uncategorized",
  "title": "Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations",
  "authors": ["Ahuja, Naman","Bansal, Ritesh","Ingram, William A.","Jude, Palakh","Kahu, Sampanna","Wang, Xinyue"],
  "year": "2018",
  "date_published": "2018",
  "container_title": "Virginia Tech",
  "institution": "Virginia Tech",
  "url": "http://hdl.handle.net/10919/86406",
  "abstract": "Team 16 in the fall 2018 course \"CS 4984/5984 Big Data Text Summarization,\" in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus.\n                 The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL.\n                 The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process."
}
  ]
}
