My research explores the application of computational methods and techniques to large-scale digital collections held by libraries and archives. Specifically, I work within the fields of digital libraries and information retrieval to mine knowledge from electronic theses and dissertations.
Despite the huge number of Electronic Theses and Dissertations (ETDs) publicly available online, research done by graduate students is insufficiently utilized. We lack the computational models, tools, and services for discovering and accessing the knowledge buried in these long documents. ETDs contain novel ideas and findings that make a significant contribution to the students' subject areas. They often contain extensive bibliographies and literature reviews, as well as useful graphs, figures, and tables. Much important knowledge and scientific data lie hidden in ETDs, but we need better tools to mine the content and facilitate the identification, discovery, and reuse of these important components. To address this problem, this project develops sophisticated textual analytics, natural language processing, and information extraction methods to identify and extract key components of ETDs containing important knowledge that would otherwise remain buried in these long documents. We investigate techniques and build predictive models to automatically classify and summarize these extracted components. In doing so, we aim to answer the following fundamental research questions: (1) How can we effectively identify and extract key parts of ETDs such as chapters, literature reviews, bibliographies, graphs, tables, and figures? (2) How can we develop effective classification and summarization services for ETDs at the chapter level? (3) How can we use these services to enrich the user experience for digital libraries of ETDs? Text analytics presents a novel way to connect text with language understanding. By investigating analytical methods for extracting, classifying, and summarizing the knowledge contained in ETDs, our research demonstrates how intensive computational analysis of digital collections can provide more effective access to book-length documents, increase the impact of graduate research, and help libraries meet the evolving needs of the communities they serve.
Despite the huge number of books held in digital libraries, there is a lack of computational models, tools, and services for discovering and accessing the knowledge they contain. Current models are limited to basic metadata and full-text search. We need better tools to mine the knowledge and scientific data buried inside books and other book-length documents, like theses and dissertations. Graduate students write theses and dissertations, and they go into digital libraries. But their work is not read or cited as much as shorter forms of research output, like journal articles and conference proceedings. More generally, the needs of students, researchers, and others in the academic community are quickly evolving due to rapid advancements in digital technology. Libraries struggle to evolve alongside the communities they serve.