Skip to main content Skip to docs navigation

Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale

View at Publisher →

Abstract

Metadata is crucial for the accessibility, interoperability, and long-term usability of digital objects such as Electronic Theses and Dissertations (ETDs). In large-scale academic repositories, poor metadata quality can significantly impede the discovery and use of resources. This study addresses persistent issues of incomplete and inconsistent ETD metadata collected from U.S. university libraries. However, directly applying machine learning-based error detection and correction models may introduce unwanted errors due to the imperfection of these models. We propose an ETD metadata improvement system (ETDMIS) that mitigates the problem by integrating metadata validation and a version control mechanism. Our system was applied to a dataset of 100,000 U.S. ETDs, resulting in substantial improvements in metadata quality. Scalability was demonstrated by processing the entire dataset efficiently. The original and the enhanced metadata for the 100,000 ETDs are publicly accessible at https://github.com/lamps-lab/ETDMiner/tree/master/Meta100K.

Citation

, , , and . . Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale.” In 2024 IEEE International Conference on Big Data (BigData ’24), Washington, DC, USA, pp. 88258827. Poster Presentation. 10.1109/BigData62323.2024.10825738

BibTeX

@inproceedings{salsabil2024toward,
  author = {Salsabil, Lamia and Wu, Jian and Ingram, William A. and Fox, Edward A.},
  title = {Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale},
  booktitle = {2024 {IEEE} International Conference on Big Data},
  series = {BigData '24},
  year = {2024},
  pages = {8825--8827},
  keywords = {Scalability; Machine learning; Metadata; Big Data; Virtual machines; Libraries; Usability; Interoperability; Metadata Quality; ETD; Digital Libraries; Scholarly Big Data},
  doi = {10.1109/BigData62323.2024.10825738},
  publisher = {{IEEE}},
  location = {Washington, DC, USA},
  month = dec,
  date = {15-18},
  issn = {2573-2978},
  podissn = {2639-1589},
  note = {Poster Presentation},
  month_numeric = {12}
}