Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents
Abstract
This study presents a novel framework for automatically classifying open-access datasets and software (OADS) URLs in scholarly documents. Accurate classification of OADS-URLs is the first step in investigating the availability and preservability of OADS, a crucial step toward open science and computational reproducibility. Our framework, EnSU, leverages an ensemblebased approach to classify OADS-URLs by their citation contexts. The ensemble integrates three models: a Supervised Contrastive Learning model, a SciBERT-based model, and a BertGCN model. Our framework distinguishes the resource types (dataset vs. software) and providers (author vs. third-party). To train and evaluate EnSU, we compiled a dataset, OADS-1K, comprising 1,129 manually annotated sentences containing URLs along with their expanded contexts. Our model outperforms all baseline classifiers, including a large language model-based approach, with the best F 1 -score of 90%. The dataset and source code are publicly available at: https://github.com/lamps-lab/EnSU/tree/main.
Citation
2025. “Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents.” In Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’25), Virtual Event, pp. 197–206. 10.1109/JCDL67857.2025.00031BibTeX
@inproceedings{salsabil2025contextbased,
title = {Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents},
author = {Salsabil, Lamia and Obadage, Rochana R. and Banerjee, Bipasha and Abeysinghe, Yasasi and Alam, Sawood and F\"{a}rber, Michael and Ingram, William and Fox, Edward and Wu, Jian},
year = {2025},
booktitle = {Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries},
series = {JCDL '25},
location = {Virtual Event},
pages = {197--206},
doi = {10.1109/JCDL67857.2025.00031}
}