Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale
Abstract
Metadata is crucial for the accessibility, interoperability, and long-term usability of digital objects such as Electronic Theses and Dissertations (ETDs). In large-scale academic repositories, poor metadata quality can significantly impede the discovery and use of resources. This study addresses persistent issues of incomplete and inconsistent ETD metadata collected from U.S. university libraries. However, directly applying machine learning-based error detection and correction models may introduce unwanted errors due to the imperfection of these models. We propose an ETD metadata improvement system (ETDMIS) that mitigates the problem by integrating metadata validation and a version control mechanism. Our system was applied to a dataset of 100,000 U.S. ETDs, resulting in substantial improvements in metadata quality. Scalability was demonstrated by processing the entire dataset efficiently. The original and the enhanced metadata for the 100,000 ETDs are publicly accessible at https://github.com/lamps-lab/ETDMiner/tree/master/Meta100K.
Citation
2024. “Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale.” In 2024 IEEE International Conference on Big Data (BigData ’24), Washington, DC, USA, pp. 8825–8827. Poster Presentation. 10.1109/BigData62323.2024.10825738BibTeX
@inproceedings{salsabil2024toward,
author = {Salsabil, Lamia and Wu, Jian and Ingram, William A. and Fox, Edward A.},
title = {Toward Automatically Improving Metadata Quality of Electronic Theses and Dissertations at Scale},
booktitle = {2024 {IEEE} International Conference on Big Data},
series = {BigData '24},
year = {2024},
pages = {8825--8827},
keywords = {Scalability; Machine learning; Metadata; Big Data; Virtual machines; Libraries; Usability; Interoperability; Metadata Quality; ETD; Digital Libraries; Scholarly Big Data},
doi = {10.1109/BigData62323.2024.10825738},
publisher = {{IEEE}},
location = {Washington, DC, USA},
month = dec,
date = {15-18},
issn = {2573-2978},
podissn = {2639-1589},
note = {Poster Presentation},
month_numeric = {12}
}