Natural Language Processing and Information Retrieval for Scholarly Text Analysis
My research focuses on the intersection of Natural Language
Processing (NLP) and Information Retrieval (IR) in the
context of Electronic Theses and Dissertations (ETDs). NLP
provides methods for structuring, classifying, and summarizing textual data, while IR enables
scalable indexing, retrieval, and ranking of relevant content. Together, these fields support
computational approaches for identifying research contributions, improving metadata quality, and
facilitating more effective access to ETD collections.
Computational Methods for Document Structure and Classification
From 2019 to 2023, an IMLS-funded project developed methods for automated document structure
extraction and classification in ETDs. This work applied NLP models for:
Segmenting long-form academic texts into structured components, including chapters,
sections, figures, and tables.
Classifying text at multiple levels to improve metadata representation and retrieval.
Generating chapter-level summaries to support efficient navigation of dissertation content.
Developing a prototype digital library system that integrates these classification and
summarization methods into ETD collections.
These approaches addressed the challenges of working with unstructured academic texts,
establishing methods for large-scale processing of ETDs while
maintaining their integrity as structured research objects.
Retrieval-Augmented NLP for SDG Classification and Synthesis
A subsequent IMLS-funded project (2024–2027) builds on these methods by integrating retrieval-augmented models with a curated ETD
corpus. This project investigates:
Retrieval-augmented generation (RAG) for incorporating external knowledge from ETDs into
language model outputs.
Semantic representations of research contributions through transformer-based classifiers
trained to recognize alignment with United Nations’ Sustainable
Development Goals (SDGs).
Transfer learning techniques for adapting classification
models trained on journal literature to the distinct characteristics of ETDs.
Hybrid retrieval models that integrate sparse (BM25) and dense (neural) retrieval for more
precise identification of SDG-related research.
Text generation methods for synthesizing research narratives from retrieved ETD content,
providing structured summaries of institutional research contributions.
This work applies IR techniques to enhance NLP-driven classification and text synthesis, addressing the
limitations of keyword-based SDG identification by introducing semantic
retrieval and contextual classification models. The integration of retrieval-based
ranking and language modeling supports research applications that extend beyond static metadata,
allowing for more nuanced representations of academic work.
Artificial Intelligence for Scholarly Knowledge Systems
The methodological contributions of this research inform the development of computational
models for scholarly information systems, with applications in document retrieval,
classification, and synthesis. By combining retrieval-based ranking, transformer-based text
classification, and generative summarization methods, this work develops computational
approaches for structuring and analyzing large-scale ETD
collections. As Director of the Center for Digital Research and Scholarship (CDRS) at Virginia
Tech, I lead efforts to integrate AI-driven methods into digital scholarship, supporting new models of scholarly
communication and research analysis.
Grant Funding
Harnessing ETDs: Pioneering
AI-Driven Innovations in Library Service. (
LG-256638-OLS-24) PI for a 3-year applied research grant
that aims to expand the reach and impact of academic library services through integration with
Large Language Models (LLMs).Institute of
Museum and
Library Services, National Leadership Grants—Libraries. 2024. $441,724.
Enhancing Accessibility of
Electronic Theses and Dissertations (
LG-256693-OLS-24) Co-PI on a 1.5 year planning grant from
the National Leadership Grants for Libraries program to support the exploratory
phase of enhancing the accessibility of electronic theses and dissertations (ETDs) for blind
and low-vision (BLV) library users.Institute of
Museum and Library Services, National Leadership
Grants—Libraries. 2024. $117,707.
Preserving Open Access Datasets and Software for
Sustainable Computational Reproducibility
(LG-256694-OLS-24) Senior
Personnel on a 3-year Applied Research project for preserving endangered Open Access Datasets
and Software (OADS), i.e., publicly and freely available digital datasets and software
packages used for reproducing research results reported in scholarly works.Institute of
Museum and
Library Services, National Leadership Grants—Libraries. 2024. $564,991.
Opening Books and the National
Corpus of Graduate Research (LG-37-19-0078-19)
to bring computational access to book-length documents, through a research and piloting effort
employing Electronic Theses and Dissertations (ETDs). Institute of Museum and
Library Services, National Leadership Grants—Libraries. 2019. $505,214.