Data Curation in Machine Learning

Project Overview

Previous research in ML has established that bias in models often causes discriminatory or unethical judgements towards specific populations. These biases are widely attributed to the choices made about datasets used for training ML models. Datasets are often reused across tasks which are outside the original context they were created for and appropriate data use is also hindered by the hidden, tacit, and undervalued nature of data work – the many activities involved in collecting, selecting, and combining datasets as well as documenting, evaluating, sharing, reusing, and repurposing datasets. To address these issues, the study of data practices has become prominent. One branch of these studies discusses the adoption of principles from archival studies and digital curation into dataset development processes for machine learning research. Data curation is a mature field with origins in librarianship and archives whose scholarship and thinking on data issues go back many centuries. It has found new relevance today as the field of machine learning is recognizing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models.

In our work, we first presented and established how ML dataset development processes can use principles and methods from data curation in practice. To connect the two fields, we designed an evaluation tool composed of a toolkit and rubric to identify gaps and overlaps. The rubric is used to evaluate the documentation of the contents of datasets as well as the design decisions made in the process of developing datasets based on criteria adapted from the fields of digital and data curation, library, and archival studies. Our work performs the translation process of data curation knowledge into clear terms for ML to illustrate their importance to dataset development and evaluation in order to help researchers more rigorously curate their own dataset and assess others’ datasets for their use, reuse, and reproducibility.

In the next phase, we applied the rubric on sample datasets from NeurIPS, the Conference on Neural Information Processing Systems, to determine the feasibility and relevance of assessing ML data work using data curation perspectives. We then examined how well data curation is currently performed in ML through a larger evaluation of NeurIPS datasets.

You can read the full papers for more details about the first and second phase of the project, or check out our recorded talks.