Research Program

Scholarly AI, digital libraries, and research infrastructure

I study how scientific knowledge can be represented, interpreted, and operationalized by computational systems. My research investigates how scholarly communication can be transformed from documents created for human readers into representations that support computational discovery, reasoning, and scientific reuse.

For more than twenty-five years, I have worked on different manifestations of this problem, including digital libraries, repository interoperability, metadata standards, Natural Language Processing (NLP), Information Retrieval (IR), computational text analysis, and retrieval-augmented models. Across these areas, I examine how recorded research knowledge can be represented for machine processing without losing provenance, context, or interpretive meaning.

Three Layers of Machine-Usable Knowledge

I organize this work through the linked problems of access, understanding, and use in machine-usable scientific knowledge.

Knowledge Access

The first layer asks how machines find and retrieve research knowledge. My work in repository interoperability, metadata standards, ETD networks, indexing, document structure extraction, and summarization has made distributed and long-form collections available for computational analysis while preserving provenance and document context.

Knowledge Understanding

The second layer asks how machines recognize what research contributes to a goal. My dissertation addressed this layer through the applied setting of alignment between ETDs and the United Nations' Sustainable Development Goals (SDGs). It treated contribution as an evaluative relation between a research record and a goal rather than as a problem of topic classification.

The dissertation developed a framework for specifying evaluative functions with large language models and distilling those judgments into classification and retrieval systems for large research collections. The result is a computational model of scientific contribution at collection scale.

Knowledge Use

The third layer asks how machines can use research knowledge to support scientific work. My current work extends document structure, retrieval, and contribution modeling toward representations of research processes, methodological context, and workflow-like knowledge. This direction treats dissertations, reports, and related scholarly records as sources for reconstructing how research was carried out, including the methods behind reported claims.

This layer connects the earlier work on access and evaluation to AI-assisted research systems that need structured accounts of methods, evidence, decisions, and sequences of work. The immediate problem is to extract and represent those accounts without collapsing them into keywords or generic summaries.

ETDs as Long-Form Research Records

Electronic Theses and Dissertations (ETDs) are long-form research records within this program. They preserve extended argument, document structure, methods, literature review, data description, and research context at a level of detail that is often condensed in journal publication. My work has focused on making these records available for computational analysis without reducing them to flat metadata or isolated text snippets.

My work on ETDs includes document structure extraction, metadata enrichment, classification, retrieval, and summarization. These methods support retrieval and discovery across large collections while preserving the document context needed to interpret long-form research.

Selected Infrastructure Leadership

My earlier work in repository interoperability, metadata design, national-scale discovery, and ETD networks addressed the same research problem by making scientific and research knowledge accessible to computational systems. It connects to current research on computational access, document structure extraction, retrieval, classification, summarization, and metadata enrichment.

Repository interoperability and OAI-PMH made distributed research outputs harvestable across systems.
The Networked Digital Library of Theses and Dissertations and related ETD work supported national and international coordination around computational access to long-form research records.
Metadata standards, research discovery systems, and digital library services established mechanisms for provenance, aggregation, and cross-system discovery.

Research Projects

From 2019 to 2023, an IMLS-funded project developed methods for automated document structure extraction and classification in ETDs. The project produced methods for transforming unstructured research texts into computationally tractable research objects while preserving document structure and research context.

A subsequent IMLS-funded project (2024-2027) extends those methods by integrating retrieval-augmented models with curated ETD corpora. The project investigates semantic retrieval, contextual classification, text generation constrained by retrieved evidence, and hybrid sparse/dense ranking for institutional research analysis. These projects move from static metadata toward representations that support AI-enabled discovery systems.

My dissertation research extended this work from access to interpretation by developing computational methods for evaluating research contributions relative to explicit goals. Using SDG alignment as a testbed, the dissertation developed a framework for goal-conditioned evaluation in scholarly collections, treating large language models as explicit evaluative functions whose judgments can be distilled into efficient retrieval and classification systems.

Leadership in Scientific Knowledge Infrastructure

This research applies AI within a longer program of making scientific knowledge usable by computational systems. As Director of the Center for Digital Research and Scholarship (CDRS) at Virginia Tech, I lead efforts that connect digital library practice, research communication, information retrieval, NLP, and AI-enabled research systems, extending long-standing work in discovery and interoperability toward machine-usable science.

Grant Funding

Harnessing ETDs: Pioneering AI-Driven Innovations in Library Service . ( LG-256638-OLS-24) PI for a 3-year applied research grant that aims to expand the reach and impact of academic library services through integration with Large Language Models (LLMs). Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $441,724.
Enhancing Accessibility of Electronic Theses and Dissertations ( LG-256693-OLS-24) Co-PI on a 1.5 year planning grant from the National Leadership Grants for Libraries program to support the exploratory phase of enhancing the accessibility of electronic theses and dissertations (ETDs) for blind and low-vision (BLV) library users. Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $117,707.
Preserving Open Access Datasets and Software for Sustainable Computational Reproducibility (LG-256694-OLS-24) Senior Personnel on a 3-year Applied Research project for preserving endangered Open Access Datasets and Software (OADS), i.e., publicly and freely available digital datasets and software packages used for reproducing research results reported in scholarly works. Institute of Museum and Library Services, National Leadership Grants—Libraries. 2024. $564,991.
Ensuring Scholarly Access to Government Records and Archives (1910-07229) to support a convening of experts to address machine-learning techniques to enhance public access to government records. Andrew W. Mellon Foundation. 2020. $44,000.
Opening Books and the National Corpus of Graduate Research (LG-37-19-0078-19) to bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). Institute of Museum and Library Services, National Leadership Grants—Libraries. 2019. $505,214.