Skip to main content Skip to docs navigation

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

View at Publisher →

Abstract

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.

Citation

, , , and . . A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations.” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20), Virtual Event, China, pp. 515516. 10.1145/3383583.3398590

BibTeX

@inproceedings{choudhury2020heuristic,
  author = {Choudhury, Muntabir Hasan and Wu, Jian and Ingram, William A. and Fox, Edward A.},
  title = {A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations},
  booktitle = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020},
  year = {2020},
  series = {JCDL '20},
  address = {New York, NY, USA},
  pages = {515--516},
  doi = {10.1145/3383583.3398590},
  isbn = {9781450375856},
  publisher = {Association for Computing Machinery},
  location = {Virtual Event, China}
}