The humanity of data: Lessons from data production and data governance
Human beings are at the core of the data economy. Machine learning systems and other AI technologies largely rely on data created by people. Some of these data are sourced “organically,” from people’s everyday online interactions. Other data come from “data workers,” who are tasked with producing specific datasets.
Current approaches to data governance, however, do not fully grapple with the realities of data production. At the Schwartz Reisman Institute's recent graduate workshop, "Views on Techno-Utopia," hosted as part of SRI’s inaugural annual conference Absolutely Interdisciplinary, presenters discussed the disconnect between the complex circumstances in which data used to train AI systems are sourced and the laws and regulations designed to protect people’s data rights.
According to Jamie Duncan, a PhD student at the University of Toronto's Centre for Criminology and Sociolegal Studies, “treating data primarily as an economic good neglects its humanity. Personal data about us is always created in the context of our social and technical interactions.”
One common context in which data are created is crowdsourcing platforms, where individuals perform discrete on-demand tasks, such as data annotation and verification, in exchange for small payments. Julian Posada, an SRI graduate fellow and PhD candidate at U of T's Faculty of Information, conducted an original field study to shed light on this data production process.
A case study in data production
Technology companies routinely outsource labor-intensive data production to crowdsourcing platforms. According to Posada, important questions about this practice remain open around where this labor originates, and the real-world identities of workers behind such platforms.
To answer these questions, Posada analyzed the internet traffic of several major crowdsourcing platforms, finding that the greatest number of data workers outside the United States were located in Venezuela.
The next step in Posada’s research was to begin a conversation with these data workers. “I talked to Venezuelan workers about why they decided to work for these platforms,” explains Posada. A recurring response was Venezuela’s distinct combination of economic, social, and political crises, coupled with widespread computer and internet access, which enable them to perform crowdsourcing tasks.
Melba, a Venezuelan retiree, explained to Posada that “even though I receive a salary and a pension, and my husband as well, our paychecks don’t cover anything. I wonder, how can people survive here in Venezuela when someone like me, with a [monthly] pension worth 1,800,000 bolivares [around 1 USD], can’t buy half a dozen eggs?”
Posada's field research also uncovered new insights into who participates in data work. María, a hair stylist, described how her entire family, including her three children, participates in fulfilling freelance data labour contracts: “We distribute the tasks. For example, I can go buy groceries and the others stay working at home. We all cooperate; we are a team.”
It is unclear whether technology companies or regulators are familiar with these data work arrangements.
Challenges in data governance
In his presentation, Jamie Duncan emphasized how current approaches to data governance and policy directly confront the human experiences and values underlying data production.
For example, Duncan suggests that while Canada’s Bill C-11 would grant citizens new data rights—such as a right to data erasure and the right to move personal information between data collectors—the exemptions contained in the bill are too broad, and fail to come to terms with the humanity of personal data.
“While framed as a charter of rights, I think the bill advances a market-oriented vision of digital citizenship that reflects a social contract characterized by non-negotiability more akin to corporate terms of service,” contends Duncan.
This critique applies to other data governance proposals. According to Duncan, data trusts, data aggregators, and other data rights proposals can overlook the humanity inherent in personal data. “Data trusts promise to introduce new forms of accountability and could potentially even enhance the economic negotiating position of data subjects," notes Duncan. "But the mechanism of the trust as it exists currently is limited in its focus on managing property and not the conditions under which that property is produced.”
Duncan argues that data governance must begin by recognizing that personal data is not a raw material or commodity. Instead, data are produced through a complex web of human interactions and sociotechnical systems.
New directions
Data policy and governance remains an emerging field. It has much to learn from both rich qualitative field studies, such as Posada’s first-hand account of the experiences of data workers, and detailed considerations of the implications of policy, such as those offered by Duncan. More comprehensive, quantitative studies could also improve our understanding of the data production process—including its external sociological impacts that lie beyond the immediate pipeline of commodified data.
Future efforts to understand and regulate data production could also benefit from more clearly defined guiding principles. Duncan offers one potential direction: “Our identities and actions as citizens exist in the context of the communities we belong to. As these relationships become increasingly datafied, it is essential that we recognize that we are in this together.”