Skip to main content Skip to docs navigation

Research Program

Natural Language Processing and Information Retrieval for Scholarly Text Analysis

My research focuses on the intersection of Natural Language Processing (NLP) and Information Retrieval (IR) in the context of Electronic Theses and Dissertations (ETDs). NLP provides methods for structuring, classifying, and summarizing textual data, while IR enables scalable indexing, retrieval, and ranking of relevant content. Together, these fields support computational approaches for identifying research contributions, improving metadata quality, and facilitating more effective access to ETD collections.

Computational Methods for Document Structure and Classification

From 2019 to 2023, an IMLS-funded project developed methods for automated document structure extraction and classification in ETDs. This work applied NLP models for:

  • Segmenting long-form academic texts into structured components, including chapters, sections, figures, and tables.
  • Classifying text at multiple levels to improve metadata representation and retrieval.
  • Generating chapter-level summaries to support efficient navigation of dissertation content.
  • Developing a prototype digital library system that integrates these classification and summarization methods into ETD collections.

These approaches addressed the challenges of working with unstructured academic texts, establishing methods for large-scale processing of ETDs while maintaining their integrity as structured research objects.

Retrieval-Augmented NLP for SDG Classification and Synthesis

A subsequent IMLS-funded project (2024–2027) builds on these methods by integrating retrieval-augmented models with a curated ETD corpus. This project investigates:

  • Retrieval-augmented generation (RAG) for incorporating external knowledge from ETDs into language model outputs.
  • Semantic representations of research contributions through transformer-based classifiers trained to recognize alignment with United Nations’ Sustainable Development Goals (SDGs).
  • Transfer learning techniques for adapting classification models trained on journal literature to the distinct characteristics of ETDs.
  • Hybrid retrieval models that integrate sparse (BM25) and dense (neural) retrieval for more precise identification of SDG-related research.
  • Text generation methods for synthesizing research narratives from retrieved ETD content, providing structured summaries of institutional research contributions.

This work applies IR techniques to enhance NLP-driven classification and text synthesis, addressing the limitations of keyword-based SDG identification by introducing semantic retrieval and contextual classification models. The integration of retrieval-based ranking and language modeling supports research applications that extend beyond static metadata, allowing for more nuanced representations of academic work.

Artificial Intelligence for Scholarly Knowledge Systems

The methodological contributions of this research inform the development of computational models for scholarly information systems, with applications in document retrieval, classification, and synthesis. By combining retrieval-based ranking, transformer-based text classification, and generative summarization methods, this work develops computational approaches for structuring and analyzing large-scale ETD collections. As Director of the Center for Digital Research and Scholarship (CDRS) at Virginia Tech, I lead efforts to integrate AI-driven methods into digital scholarship, supporting new models of scholarly communication and research analysis.

Grant Funding

  • Harnessing ETDs: Pioneering AI-Driven Innovations in Library Service . ( LG-256638-OLS-24) PI for a 3-year applied research grant that aims to expand the reach and impact of academic library services through integration with Large Language Models (LLMs). Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $441,724.
  • Enhancing Accessibility of Electronic Theses and Dissertations ( LG-256693-OLS-24) Co-PI on a 1.5 year planning grant from the National Leadership Grants for Libraries program to support the exploratory phase of enhancing the accessibility of electronic theses and dissertations (ETDs) for blind and low-vision (BLV) library users. Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $117,707.
  • Preserving Open Access Datasets and Software for Sustainable Computational Reproducibility (LG-256694-OLS-24) Senior Personnel on a 3-year Applied Research project for preserving endangered Open Access Datasets and Software (OADS), i.e., publicly and freely available digital datasets and software packages used for reproducing research results reported in scholarly works. Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $564,991.
  • Ensuring Scholarly Access to Government Records and Archives (1910-07229) to support a convening of experts to address machine-learning techniques to enhance public access to government records. Andrew W. Mellon Foundation. 2020. $44,000.
  • Opening Books and the National Corpus of Graduate Research (LG-37-19-0078-19) to bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). Institute of Museum and Library Services, National Leadership Grants—Libraries. 2019. $505,214.