Mastodon Research Collaboration Opportunity: Data Curation in Machine Learning | Just Sustainability Design
Just Sustainability Design Technologies for Just and Sustainable Communities

Research Collaboration Opportunity: Data Curation in Machine Learning

Are you interested in your data curation practices?

Summary: We are looking to collaborate with a ML group (industry or academia) who wants to review and refine their data practices for improved documentation and reflexivity of their ML data workflow and better ML model outcomes. Depending on what you are interested in, this collaboration could take different shapes: jointly evaluating datasets, organizing workshops, action research, or other activities. We provide more information and outline some options below.

Context: ML research has established that bias in models often causes discriminatory or unethical judgements towards specific populations. These biases are widely attributed to the choices made about datasets used for training ML models. Datasets are often reused across tasks which are outside the original context they were created for and appropriate data use is also hindered by the hidden, tacit, and undervalued nature of data work. The adoption of principles from archival studies and digital curation into dataset development processes for machine learning research has been found as one way to address these concerns. Data curation is a mature field with origins in librarianship and archives whose scholarship and thinking on data issues go back many centuries. It has found new relevance today as the field of machine learning is recognizing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models. In practice, it has been difficult to adopt these concepts because they do not easily apply.

Background: This project will build on our previous work in applying data curation principles in ML dataset development. We have previously established how data curation principles can be applied to curate datasets in ML in practice. We translated data curation principles and concepts for the ML dataset development context and developed an evaluation framework (composed of a toolkit and rubric) for dataset documentation in ML. We presented first findings and a sample set of evaluations of ML datasets in a FAccT2024 paper (see talk and paper). We then examined how well data curation is currently performed in ML through a larger evaluation of NeurIPS datasets (accepted in the NeurIPS D&B track in 2024, see talk and paper). Our findings lead to suggestions on where data curation in ML can be improved.

Research Opportunities: We are currently looking to collaborate with ML groups (industry or academia) who want to improve their data pipeline and practices and would be open to exploring how our framework can be adopted into their workflows. Our goal is to help improve data practices by applying data curation concepts; this collaborative study can have several formats and designs, we list a few versions below but are open to others based on the group’s needs and interests.

Description: We are looking to pilot our data curation for ML framework and integrate it into ML pipelines. This study aims to understand a ML group’s curation needs and collaboratively reflect on and enhance their data curation practices using our data curation framework. We anticipate that the collaborative process will result in a progressive learning cycle and this may lead to the framework evolving to better suit the needs of the ML group.

Requirements:

Expected Activities and Outcomes: The findings from this study would result in 1) improved joint understanding of the benefits resulting from data curation, 2) tailored insights for the team’s curation processes, and 3) deepened cross-disciplinary scholarship. The project would thus provide insights on practical approaches in performing and adopting data curation practices that result in improved datasets and consequently models.

Potential activities with the expected outcomes are listed below, this is not a comprehensive list and the activities will depend on the collaborative needs of the project.

  1. Workshop to introduce data curation concepts
  2. Collaborative examination of current curation practices and proposal for improvements for data workflows
  3. Sample evaluations of dataset documentation for new datasets created by the ML team
  4. Sample evaluations of dataset documentation for existing datasets used by the ML team (or in consideration to be used)
  5. Examination of dataset(s) to glean usage to be applied for various ML efforts

If you are interested, please contact Prof. Christoph Becker at christoph.becker@utoronto.ca.

Previous post
Workshop: Computing and AI Through the Lens of Degrowth