Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents

Salsabil, Lamia; Obadage, Rochana R.; Banerjee, Bipasha; Abeysinghe, Yasasi; Alam, Sawood; Färber, Michael; Ingram, William; Fox, Edward; Wu, Jian

doi:10.1109/JCDL67857.2025.00031

Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents

Abstract

This study presents a novel framework for automatically classifying open-access datasets and software (OADS) URLs in scholarly documents. Accurate classification of OADS-URLs is the first step in investigating the availability and preservability of OADS, a crucial step toward open science and computational reproducibility. Our framework, EnSU, leverages an ensemblebased approach to classify OADS-URLs by their citation contexts. The ensemble integrates three models: a Supervised Contrastive Learning model, a SciBERT-based model, and a BertGCN model. Our framework distinguishes the resource types (dataset vs. software) and providers (author vs. third-party). To train and evaluate EnSU, we compiled a dataset, OADS-1K, comprising 1,129 manually annotated sentences containing URLs along with their expanded contexts. Our model outperforms all baseline classifiers, including a large language model-based approach, with the best F 1 -score of 90%. The dataset and source code are publicly available at: https://github.com/lamps-lab/EnSU/tree/main.

Citation

Lamia Salsabil, Rochana R. Obadage, Bipasha Banerjee, Yasasi Abeysinghe, Sawood Alam, Michael Färber, William Ingram, Edward Fox, and Jian Wu. 2025. “Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents.” In Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’25), Virtual Event, pp. 197–206. 10.1109/JCDL67857.2025.00031

BibTeX

@inproceedings{salsabil2025contextbased,
  title = {Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents},
  author = {Salsabil, Lamia and Obadage, Rochana R. and Banerjee, Bipasha and Abeysinghe, Yasasi and Alam, Sawood and F\"{a}rber, Michael and Ingram, William and Fox, Edward and Wu, Jian},
  year = {2025},
  booktitle = {Proceedings of the 2025 ACM/IEEE Joint Conference on Digital Libraries},
  series = {JCDL '25},
  location = {Virtual Event},
  pages = {197--206},
  doi = {10.1109/JCDL67857.2025.00031}
}