A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Choudhury, Muntabir Hasan; Wu, Jian; Ingram, William A.; Fox, Edward A.

doi:10.1145/3383583.3398590

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Abstract

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.

Citation

Muntabir Hasan Choudhury, Jian Wu, William A. Ingram, and Edward A. Fox. 2020. “A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations.” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20), Virtual Event, China, pp. 515–516. 10.1145/3383583.3398590

BibTeX

@inproceedings{choudhury2020heuristic,
  author = {Choudhury, Muntabir Hasan and Wu, Jian and Ingram, William A. and Fox, Edward A.},
  title = {A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations},
  booktitle = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020},
  year = {2020},
  series = {JCDL '20},
  address = {New York, NY, USA},
  pages = {515--516},
  doi = {10.1145/3383583.3398590},
  isbn = {9781450375856},
  publisher = {Association for Computing Machinery},
  location = {Virtual Event, China}
}