Mastodon Framework to Evaluate Data Curation in Machine Learning | Just Sustainability Design
Just Sustainability Design Technologies for Just and Sustainable Communities

Data Curation in Machine Learning

Evaluation Framework

Our framework is grounded in data curation principles, emphasizing documentation, transparency, and ethical considerations. We started with key aspects of data curation relevant to ML and followed with iterative refinement through internal reviews and adjustments to evaluation criteria, guided by digital curation lifecycle models, FAIR data principles, and environmental sustainability and justice considerations.

The framework is made up of a rubric and toolkit. The rubric consists of 18 elements across five categories. The scope category has 2 elements, ‘context, purpose, motivation’ and ‘requirements’, which emphasize the requirement for a dataset creation plan and addressing 150 intrinsic biases. The ethicality and reflexivity category has 4 elements, ‘ethicality’, ‘domain knowledge and data practices’, ‘context awareness’, and ‘environmental footprint’, covering a range of documentation requirements to increase reflection and accountability in the dataset creation process. The data pipeline category includes ‘data collection’, ‘data processing’, and ‘data annotation’, prompting reflection on how and why choices were made and their implications. The data quality category contains ‘suitability’, ‘representativeness’, ‘authenticity’, ‘reliability’, and ‘structured documentation’, to ensure the consideration of a broad set of qualities that impact how well a dataset can be appropriately and responsibly reused. The data management category covers FAIR principles - findability, accessibility, interoperability, and reusability - included to evaluate the transparency of data management considerations. Each rubric element is assessed on minimum standard criteria (with a score of ‘pass’ or ‘fail’) that detail the expected level of documentation. Elements that pass the minimum standard are also assessed on a standard of excellence (with a score of ‘full’, ‘partial’, or ‘none’).

We also provide the toolkit; it is a supplementary resource that introduces concepts from data curation and serves as a manual to the rubric. It contains instructions and guidance on how to evaluate datasets, how to interpret specific elements, guiding principles, recommendations, FAQ, sample evaluations, a glossary, and further readings.

Below we provide the documents for the rubric, rubric worksheet (blank version to enter comments), toolkit, and a link to the glossary via Google Docs.

figure image

Documents

Rubric
Rubric Worksheet
Toolkit
Glossary