NeurIPS Datasets and Benchmarks 2024
Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models – evidenced not least by the creation of the Datasets and Benchmarks track itself. Our work provides an analysis of recent dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a thorough literature review of data curation principles. We use the framework to systematically assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. Our results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. In this work, we suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML.
It was accepted at The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets & Benchmarks track as a spotlight poster. The paper can be found here and the recorded talk is available on YouTube.