The data-production dispositif: How to analyze power in data production for machine learning
As researchers in computer and information science with backgrounds in sociology and communication, we have been studying how data for machine learning (ML) is produced, notably when the production chain requires outsourced workers. Because ML demands lots of data and its production is a labour-intensive process, many organizations utilize crowdsourcing platforms that employ independent contractors working from home, and business process outsourcing (BPO) companies that hire workers on-site. These workers perform what we call “data work,” which involves generating and annotating data and verifying algorithmic outputs.
Throughout our multi-year research project analyzing data work in both BPOs and platforms—including interviews with workers, ML practitioners, and company managers, and the study of company documentation, including instructions—we have explored how social interactions shape outsourced data work and how labour conditions shape data. We call the ensemble of discourses, actions, and objects (interfaces and documentation) the “data-production dispositif,” borrowing from Foucauldian studies on power in social settings. Our analysis focuses on Latin America, particularly, on Venezuelan and Argentine data workers.
What is the “data-production dispositif”?
In a new paper forthcoming in the journal Proceedings of the ACM on Human-Computer Interaction, we study this phenomenon, first by analyzing how data is classified by platform and BPO clients. We found that taxonomies in instruction documents reflect the worldviews of ML developers, the economic interests of their companies, and the particular legal frameworks of their countries of origin. For example, most labels for hate speech in social media used to train content moderation algorithms reflect protected categories in the United States alone and forms of racial classification seen primarily in this country, because most social media companies are based primarily in the US.
In the discourses present in instructions and taxonomies, we also found a preference for binary classifications, which are more effortless to compute than complex forms of taxonomy. For instance, in instructions to label images of faces for computer vision, requesters ask workers to identify “female” and “male” faces. Through this request, companies are not only asking workers to guess someone’s gender, but also imposing a binary way to see gender in data.
Forms of classification have been present throughout human history, and they have reflected evolving forms of human taxonomies. For example, the United States’ census has changed the question used to collect data on race numerous times, from including just three categories (“slaves,” “free Whites,” and “all other free persons”) in 1790, to two options for ethnicity and fifteen for race in 2020.
Taxonomies are therefore both a subjective matter and a political one, especially when they involve forms of human classification. The analyzed results become most problematic when the scope of their data collection and production is global, involving workers from across the globe and algorithms that are applied by users in different jurisdictions. This myriad of players and the fear of workers influencing the data has prompted many companies to restrict workers agency, while maintaining control of platforms’ algorithms through the managerial procedures of BPOs.
Beyond classifications, we found that the data-production dispositif also manifests through actions. Instruction documents provided by BPOs constantly remind workers that they can be reprimanded or fired for not following instructions. The dominant mindset is that “the client is always right.” Given that most workers find themselves under precarious working conditions, and many report high levels of dependency due to local economic conditions (e.g. the recent crises in Venezuela and Argentina), current forms of outsourced data production guarantee worker obedience.
Furthermore, the design of data annotation interfaces leaves no room for workers to provide feedback to clients or voice their concerns. Even when workers find errors in the tasks—an often-recurrent case—the fear of being expelled from a project or fired altogether combined with the limited communication channels available prompted many workers to remain silent.
We found that these three elements—the discourses coded in instructions to workers, labour practices and social contexts, and the technological design of interfaces used by workers—combined with the relations of power and submission they generate constitute the data-production dispositif. This paradigm has developed in response to the voracious demand for more and cheaper data by the AI industry alongside a lack of effective international regulation. The dispositif produces workers that are alienated from the rest of the ML development process, who come from low-income areas, are dependent on this work, and are considered easily replaceable by the companies who employ them.
“Our focus should not remain only on the short or long-term consequences of technology but should also pay attention to how technology comes to be, and how its production affects those workers who remain overexploited and pushed to the margins.”
Alternative approaches to data work: three proposals
Our paper calls for dismantling this system of exploitation. We know that this is no easy task. Nonetheless, we propose three alternatives and state why they are necessary:
Fighting alienation by making the ML pipeline visible to workers.
Instead of encouraging workers to “think like a machine” and remain obedient, companies should consider data workers as an integral part of the ML development process and provide workers with information about the final ML product that will be trained on the datasets produced by them.
Why would companies want to involve data workers more in the data production process? Knowing the purpose behind data-related tasks could help workers perform better and produce better data. Communication and feedback channels would allow workers to communicate discrepancies to requesters without fearing retaliation.
Fighting precarization by considering data workers as assets.
The protection of human rights is a fundamental aspect of ethical AI systems. Since labour rights are human rights, dignified working conditions in AI development become necessary.
Why would requesters want to guarantee working conditions in outsourcing? In our comparison between contractual BPO workers and externalized platform workers, we found higher levels of engagement and a sense of belonging amongst the latter group. Improving labour conditions in data work is not only ethically correct but also could have positive results in data quality.
Fighting workers’ surveillance and control by encouraging feedback loops.
We found companies that accept feedback from workers could identify and correct errors quicker. Feedback from workers could provide valuable information that would help improve data through the generation and annotation process.
Why would requesters want to be questioned in their logic? Given that classification is always inherently subjective, worker interrogation—and the involvement of other stakeholders, such as those potentially harmed by an algorithm—can lead to taxonomies that reflect more than a dominant worldview.
Through our research, we have developed a novel mode of analysis to investigate power in ML data and systems. We hope that the computing community and the public at large will pay closer attention to how large data sets come to be, the full scope of the diversity of stakeholders involved in the process, and how they are treated. Our focus should not remain only on the short or long-term consequences of technology but should also pay attention to how technology comes to be, and how its production affects those workers who remain overexploited and pushed to the margins.
Want to learn more?
About the authors
Milagros Miceli
DAIR Institute; Weizenbaum Institute
Milagros Miceli is a sociologist currently finishing her PhD in computer science at Technische Universität Berlin. She is a research fellow at the Distributed Artificial Intelligence Research (DAIR) Institute where she is developing research philosophy to avoid exploitative practices in AI, especially based on the experience of data workers. Her research interrogates work practices of machine learning data production and focuses on power differentials as well as their effects on datasets and system outputs. In Fall 2022, Miceli will join the Weizenbaum Institute in Berlin to lead the new research group Data, Algorithmic Systems, and Ethics.
Julian Posada
Schwartz Reisman Institute, University of Toronto
Julian Posada is a researcher and educator currently finishing his PhD in information studies at the University of Toronto. In Summer 2022, he will join Yale University’s Department of American Studies as a postdoctoral associate and next year as an assistant professor in Critical Information Studies. Posada’s interdisciplinary research on labour and technology has been published in the journals Information, Communication & Society, and the Proceedings of the ACM on Human-Computer Interaction, and in chapters published by Oxford University Press and SAGE. As a scholar committed to public engagement, his work has also been featured in venues like Logic Magazine, Medium, and Notes from Below.