We need a 21st century framework for 21st century problems
There are many contexts in which organizations want to use data about people in order to gain aggregate-level insights rather than to learn about a particular individual. For instance, in 2018, the government of Victoria in Australia released a large dataset of transit usage information for analysis. Unfortunately, although the data was “de-identified” researchers showed that re-identification of users was possible. (See our discussion of this example.) Because of instances such as this, researchers are creating new techniques for privacy-preserving data analysis in order to facilitate publicly beneficial uses while providing strong privacy guarantees.
An important question to ask of Canada’s newly-proposed federal privacy legislation—the Consumer Privacy Protection Act (CPPA)—is whether it creates the right incentives for both the adoption of best practices for privacy-preserving data analysis and for the creation of the kinds of technical standards for these practices that will help ensure public trust in their use.
We think the answer to this question is no.
The CPPA creates two new pathways for analyzing data in the manner we just described, where the point is to gain aggregate-level insights rather than learn something about particular individuals. One pathway permits the use of personal information by an organization for internal research, analysis and development purposes (s. 21). The other pathway permits the disclosure of personal information for socially beneficial purposes (s. 39). Although there are many important details, two key features are: 1) that these uses and disclosures are permitted without the consent of the data subject, and 2) that the personal information must be “de-identified.”
It is this reliance on de-identification that we think is a major problem and that we will focus our remarks on in this blog post, rather than the consent issue.
Why de-identified data approaches are insufficient
For the technical community of privacy researchers, de-identification is a failed paradigm. The idea behind de-identification is that you can modify a dataset by removing direct and indirect identifiers, thereby mitigating the privacy risk that anyone using the data could learn something about a specific individual. Further mitigation is possible through other measures such as data-sharing agreements that place legal limits on the use of the data.
The main reason that deidentified data is considered a failed paradigm is that it focuses on the idea of identifiers and quasi-identifiers. But since what identifies an individual depends on what others know or will know about them, there is little hope of determining what an “identifier” is. This problem becomes more acute with the systematic and cumulative effects of different information becoming available through multiple channels both now and in the future.
“Instead of de-identification, the technical community has embraced a different set of approaches to privacy-preserving data analysis, such as differential privacy and homomorphic encryption… these approaches provide far superior privacy guarantees when compared with de-identification.”
Moreover, de-identification methods that focus on removing direct and indirect identifiers create datasets that are less accurate and less suitable for contemporary data analysis techniques, such as machine learning. Thus, the drawbacks of de-identification are severe: not only does it not protect privacy—because it is impossible to assess the risk of re-identification without making assumptions about what adversaries know or will know in the future—but it also prevents useful analyses of the data for legitimate purposes—because features removed from the data may contain signals that are useful to the analysis. As an example, de-identifying data before it is used to train a machine learning model on healthcare data would not only fail to protect the data of patients, it would also prevent leveraging that data using machine learning for an application that is potentially beneficial to society.
Instead of de-identification, the technical community has embraced a different set of approaches to privacy-preserving data analysis, such as differential privacy and homomorphic encryption, which we have discussed in previous postings. Combined with other technical methodologies such as federated learning and secure computing environments, these approaches provide far superior privacy guarantees when compared with de-identification. Importantly, these approaches provide methods of quantifying privacy risks that do not depend on the kinds of contextual factors that yield so much uncertainty for the de-identification paradigm, better enabling the creation of trustworthy technical standards. For example, differential privacy quantifies additional privacy risk incurred by the data analysis, and all differential privacy methods have a privacy loss parameter that can be tuned to achieve the desired tradeoff between risk and utility.
Why the CPPA falls short on privacy protections
Evaluated against this backdrop, the CPPA’s use of de-identification is disappointing.
First, the CPPA defines “de-identify” as “to modify personal information so that an individual cannot be directly identified from it, though a risk of the individual being identified remains” (s. 2(1)). This is a very weak definition, as it focuses on the possibility of direct identification and leaves out the possibility of indirect identification. A dataset might be considered de-identified even if there is a serious possibility—or even likelihood—that an individual could be re-identified through the combination of the dataset with additional data.
Second, the ways in which the CPPA proposes to manage this remaining risk of re-identification are inadequate. The two risk mitigation provisions specifically targeting this new de-identification framework are: 1) a prohibition on re-identification (s. 75), and 2) a proportionality obligation (s. 74). We will look at both through the lens of a dataset that is de-identified according to the CPPA definition, but for which there remains a serious possibility of re-identification of individuals.
While the prohibition on re-identification would purportedly address this scenario—even if re-identification remains highly possible—this prohibition only applies to “organizations” and so leaves many potential actors outside of its scope, such as rogue employees or members of the general public who might gain access to data after a breach. Violation of this prohibition is not subject to the new administrative penalties, although if it is done “knowingly” then the organization can be found guilty of an offence and fined. Negligent re-identification is left without serious penalty.
The proportionality obligation would address this scenario by obliging the organization that de-identified the personal information to “ensure that any technical and administrative measures applied to the information are proportionate to the purpose for which the information is de-identified and the sensitivity of the personal information” (s. 74). This looks like an obligation to ensure that the risk of re-identification is commensurate with the purpose of the use/disclosure and the sensitivity of the information. However, the devil is in the details. Proportionality is an attribute of the means used to reach a goal, and so the goal remains important in determining what proportionality demands. Here, the goal is de-identification, and so the provision remains tethered to the CPPA’s definition and its weaknesses. To return to our example, it might be impossible to “directly” identify an individual in a database but quite easy to “indirectly” identify an individual. If direct identification is impossible, then wouldn’t the proportionality obligation be met?
Finally, neither of these mitigation strategies address the fact that the definition of “de-identify” in the CPPA is at odds with the best technical measures we have for enabling privacy-protective data analysis. As already outlined, techniques like differential privacy do not depend upon manipulating data in order to secure privacy guarantees, and so are not easily captured by the CPPA definition. This introduces uncertainty and legal risk for organizations who want to use practices that are superior to de-identification.
Furthermore, adoption of current best practices may simplify the law. For instance, the outcome of analysis that is differentially private remains differentially private even if this information is post-processed. This suggests that there would be no need to introduce a clause analogous to the one prohibiting re-identification if the law had been written in a way that is compatible with differential privacy.
One final mitigation strategy is to interpret the security obligations mandated by the CPPA (which largely replicate what currently exists in PIPEDA) to require stronger measures to prevent the risk of re-identification. For example, proper access controls and data-sharing agreements must be in place to ensure that only authorized parties have access to data. However, this does not prevent authorized users from inferring more sensitive information from the data than the access was initially intended for, nor does it prevent a malicious entity from gaining access to the authorized user’s copy of the data and violating the data-sharing agreement. Stronger technical measures for preventing the risk of re-identification would still be needed, yet it is far from clear that a regulator or court would interpret these security obligations to require anything beyond what the specific provisions regarding de-identification specify. This creates a great deal of confusion and uncertainty.
“we need to stop using the language of ‘de-identification’ and the kind of definition for it that relies upon data manipulation.”
What is needed instead?
We have written about alternative reforms to Canadian privacy laws in a previous post, but two aspects are worth re-emphasizing here.
First, we need to stop using the language of “de-identification” and the kind of definition for it that relies upon data manipulation. We propose using “functionally non-identifying” as an alternative, defined as follows:
“Functionally non-identifying information” means information about persons that is managed through the use of privacy-protective techniques and safeguards in order to mitigate privacy risks, so that identifying an individual is not reasonably foreseeable in, or made significantly more likely by, the context of its processing and the availability of other information.
“Privacy-protective techniques and safeguards” mean techniques and safeguards for mitigating privacy risks that can include any of the following, including in combination:
(a) modifying personal information;
(b) creating information from personal information;
(c) privacy-protective analysis;
(d) controlled access through a trusted computing environment.
Second, the proportionality principle found in the CPPA requires a significant revision. Instead of being tied to a problematic definition of de-identification, it should be tied to the idea of managing privacy risks. Furthermore, in order to manage the growing systematic privacy, this should not be understood narrowly in terms of the immediate risk of re-identification, but in terms of what it means to create risk in the data pipeline. The technical community understands this in terms of the risk of learning something about a specific individual even if that person is not “identifiable.” We therefore propose that legislation include a definition of privacy risk that is more aligned with the differential privacy paradigm:
Privacy risk: This term includes the risk of learning information about a specific individual that could not be learned without their data.
Proportionality: An organization that collects, uses or discloses information that has not been anonymized must ensure privacy-protective techniques and safeguards are applied, such that privacy risks are proportionate to the purpose for which the information is collected, used, or disclosed and the sensitivity of the information. A privacy risk is a risk of learning information about a specific individual.
Because it does not move in this direction, the proposed CPPA legislation is disappointing. It keeps us within a 20th century paradigm, instead of providing the regulatory tools and incentives to promote a 21st century paradigm. The result is that our best methods for privacy-preserving data analysis will be seen as unnecessary or even legally risky. This is in nobody’s interest.
Want to learn more? Read the other commentaries in our C-27 series:
About the authors
Lisa Austin is the Chair of Law and Technology and a professor in the University of Toronto’s Faculty of Law. Austin is an associate director at the Schwartz Reisman Institute for Technology and Society, and was previously a co-founder of the IT3 Lab at the University of Toronto, which engaged in interdisciplinary research on privacy and transparency. Austin's research and teaching interests include privacy law, property law, and legal theory. She is published in such journals as Legal Theory, Law and Philosophy, Theoretical Inquiries in Law, Canadian Journal of Law and Jurisprudence, and Canadian Journal of Law and Society. She is co-editor of Private Law and the Rule of Law (Oxford University Press, 2015), in which distinguished Canadian and international scholars take on the general understanding that the rule of law is essentially only a doctrine of public law and consider whether it speaks to the nature of law more generally and thus also engages private law. Austin's privacy work has been cited numerous times by Canadian courts, including the Supreme Court of Canada. She is also active in a number of public policy debates in Canada. In 2017, Austin received a President’s Impact Award from the University of Toronto.
Aleksandar Nikolov is an associate professor of computer science at the University of Toronto, an affiliate at the Vector Institute, and a faculty affiliate at the Schwartz Reisman Institute for Technology and Society. Nikolov’s research interests are in the connections between high dimensional geometry and computer science, and the application of geometric tools to private data analysis, discrepancy theory, and experimental design.
Nicolas Papernot is an assistant professor of electrical and computer engineering at the University of Toronto, a faculty member at the Vector Institute, a Canada CIFAR AI Chair, and a faculty affiliate at the Schwartz Reisman Institute for Technology and Society. Papernot’s research interests are at the intersection of computer security, privacy, and machine learning. His work has been applied in industry and academia to evaluate and improve the robustness of machine learning models, to input perturbations known as adversarial examples, as well as to deploy machine learning with privacy guarantees for training data at industry scale.
David Lie is a professor in the Edward S. Rogers Sr. Department of Electrical & Computer Engineering at the University of Toronto, and a research lead at the Schwartz Reisman Institute for Technology and Society. A pioneer in the field of trusted computing hardware, his work has inspired the technologies that secure mobile payments and fingerprint sensors in today’s smartphones. His current research focuses on mobile and embedded devices, trustworthy computing systems, and bridging the divide between technology and policy.