To guarantee privacy, focus on the algorithms, not the data

by Aleksandar Nikolov and Nicolas Papernot

This piece is the third in a series of posts on the features, implications, and controversies surrounding privacy law reforms in Canada and around the world in an increasingly digital and data-rich context; from the blog of the Schwartz Reisman Institute for Technology and Society.

What does it mean to protect privacy when analyzing and publishing personal information?

This is the most basic question we need to answer when proposing technical or legal methods to protect privacy, since it provides a test of whether such methods are successful. Ideally, we want a criterion of privacy protection that captures what privacy means for us, but is also precise enough so that we can check when it is satisfied. It turns out that finding such a criterion is quite subtle, and naive approaches often fail.

Bill C-11, also known as the Digital Charter Implementation Act, 2020, aims to reform legislation governing the collection, use, and disclosure of personal information for commercial activity in Canada. The question of defining privacy protection is implicit in Bill C-11.

The bill proposes a definition of an important aspect of privacy protection — preventing the identification of individuals from the release of data:

“De-identify means to modify personal information — or create information from personal information — by using technical processes to ensure that the information does not identify an individual or could not be used in reasonably foreseeable circumstances, alone or in combination with other information, to identify an individual.‍” 

In this post, we argue that this definition is inadequate because it focuses on modifying personal information rather than the method of analysis of this information. We start by illustrating the challenges inherent in de-identification through examples of how naive approaches can fail. We argue that such failures are inherent when de-identification focuses on the data rather than the analysis procedures. We also give a high-level, non-technical overview of recent frameworks of privacy protection developed by researchers in computer science, AI, and statistics, which, unlike the definition above, are centered around algorithms rather than the data themselves.

Attacks against (possibly de-identified) data analyses.

Because of the subtleties in defining de-identification, it is often instructive to start instead with failures to de-identify data. One naive attempt at de-identifying data is to simply strip the data from personally identifying information, i.e., to remove common identifiers like names, addresses, social insurance numbers, etc. There are many examples of how this approach usually fails, because seemingly non-identifying pieces of information can often be linked together to uniquely identify people. This observation goes back to pioneering work by Latanya Sweeney, who famously identified former Massachussettes Governor William Weld’s health records from his date of birth, gender, and zip code [S00]. This is known as a “reconstruction attack” (when the attacker reconstructs a data set containing personal information) or a “re-identification attack” (when the attacker can also link the personal information to identifiable individuals). To illustrate how it works, we will take a lesser-known and more recent example: the re-identification of the Myki data set by Culnane, Rubinstein, and Teague [CRT19].

Myki is a smart card-based public transit ticketing system used in the state of Victoria in Australia, not unlike the Presto card used in the Greater Toronto Area: travellers tap on when boarding a bus or train, tap off when exiting, and the system registers their trip and charges them accordingly. As part of a datathon in 2018, the Victoria government released a large data set of Myki usage information, which contestants could use to analyze the state’s public transit system. The data set contained information about trips: when and where a card was tapped on, when and where it was tapped off, and which bus or train line was used. By way of de-identification, card ID numbers were substituted with random strings, and, of course, no names or addresses of registered users were included. However, if the same card was used for several trips, then the same random string was listed for each trip.

The first step researchers took in breaking the anonymity of this data set feels surprisingly innocuous: they found their own cards in the data set. This is simple: you can look up several times and places where you took a bus. For example, the times you took a bus to work on two consecutive work days, and one weekend trip. If there is only one card in the data set that was used for all of these trips, then this must be your card. This works most of the time because it turns out that only two trips are usually enough to identify a card uniquely.

While identifying your own card scarcely feels like a privacy violation, it enables a much more damaging second step: identifying the card of someone you know. For example, knowing which card in the data set is yours, you can easily identify the card of a coworker. Perhaps the coworker takes the bus home at the same time as you, and maybe you had that one work dinner together. You can check which cards were used with yours at these times, and, again, it turns out that, more often than not, there is a unique match in the data set. Having identified the card of your coworker, you can find out any other trip they have taken: weekend trips, doctor visits, or other excursions that they probably expect to be private.

The researchers also showed that such privacy breaks can be extended further by linking the data set with publicly available information. They cross-referenced the tweets of Anthony Carbines, a member of the Victorian Legislative Assembly, with the data set, and, since Carbines often proudly tweets about the Victoria public transit system when he’s boarding a train, they could identify his card as well.

Let us draw several lessons from this attack:

  • Anything is potentially personally identifying information, because combining a few innocuous pieces of information often identifies a person uniquely.

  • Who does the identification matters. It is likely that whoever is curious about your private information already knows a lot about you. This knowledge can be leveraged to find out even more.

  • De-identification is hard in a connected world. Twitter, LinkedIn, and other social networks and websites provide easily accessible information that an attacker can leverage to facilitate re-identifying people in a supposedly de-identified data set.

In many practical scenarios, an adversary will not have direct access to the data but instead to a by-product of the data. This could be, for instance, the output of a database query, or the prediction of a machine learning model trained on the data. One may hope that, because such by-products only contain aggregate statistical information, they are inherently privacy-preserving. This hope, unfortunately, does not bear out in reality.

As a first example, let us take reconstruction attacks from counts. Researchers and official statistics agencies often publish data summaries in the form of tables which contain counts of how many people satisfy some property: these can be summaries of surveys, or voting polls, or census data published by Statistics Canada or the US Census Bureau. Such counts feel safer, from a privacy perspective, than fine-grained data sets such as the Myki data set. They can, however, still enable an attacker to identify individuals. Imagine, for example, that someone poses two counting queries to the data set: “How many computer science professors at the University of Toronto smoke?” and “How many computer science professors at the University of Toronto who were not born in Bulgaria smoke?” Each count is (probably) a relatively large number, so would not seem like a threat in isolation, but subtracting the second number from the first identifies immediately whether one of the authors of this blog post smokes.

More sophisticated versions of this kind of attack were introduced by Dinur and Nissim [DN03], who showed that, even if some noise is added to the counts, they can still be used to reconstruct private information about most individuals in a data set. Recently, researchers showed these attacks are much more than a theoretical threat, and can be successfully carried out against Diffix, a system specifically designed to answer counting queries while protecting privacy [CN20]. Moreover, the US Census Bureau has conducted its own reconstruction experiments and concluded that reconstructing sensitive microdata from the 2010 Decennial Census is possible, at least theoretically [GAM18]. On a high-level, reconstruction attacks are possible whenever the attacker can get accurate enough answers to many uncorrelated counting queries, i.e., counting questions that ask about properties that do not overlap much. (A precise mathematical condition was identified by Muthukrishnan and one of the authors of this post [MN12].)

Next, we take the example of a machine learning (ML) application. Typically, a machine learning model is obtained by analyzing a set of data known as the training set. This set contains examples of inputs to the model along with the target “label” (some kind of meaningful information about the inputs) we expect the model to predict. For instance, the inputs could be information extracted from a medical record and the target label would indicate whether or not the corresponding patient would eventually be readmitted to the hospital after being discharged. Once the model is trained, we can then use it to make predictions on previously unseen test data for which we do not know the labels. That is, the model can predict whether or not the medical record of a new patient is indicative of them being at risk of later readmission to the hospital if they are discharged. We expect that if the ML model is presented with unseen test data that is close enough to the training data (in technical terms, from the same distribution), the model will generalize to this new test data and make correct predictions.

However, models are practically speaking almost always imperfect and tend to exhibit better performance on their training data than on their test data. This means that if one takes a particular piece of data, it is more likely that the model will make the right prediction on this data if the data was already analyzed during training than if it is the first time the model predicts on it.

This leaks private information. Why?

Because observing the model’s predictions gives an adversary an advantage in determining the membership of a piece of data in the model’s training set. Imagine that the model comes from our running example of a medical application, so the model is trained on people who were all being treated at a hospital. Then inferring membership of a record in this training set reveals private information about the corresponding patient; namely, that they may suffer from a medical condition.

How do these attacks work? The adversary can guess that a piece of data was included in the training data if the model makes a correct prediction and, conversely, that a piece of data was not included in the training data if the model makes an incorrect prediction. By making guesses in this manner, we are more likely to be right about a piece of data’s membership in the training set than if we were to randomly flip a coin and guess membership accordingly. This is because we’ve observed that the model is more likely to make correct predictions on its training data than on test data. Our recent research demonstrates that performing this attack only requires that the adversary be able to observe the label predicted by the model [CTC20], but the attack can be performed with fewer interactions with the model if the adversary can also observe the confidence of the model as it predicts [SSS17].

How do we move forward?

What this points to is that, rather than focusing on making a particular set of data private (e.g., through de-identification by removing personally identifying information), the scientific community has discovered that making an analysis technique (or algorithm) private provides more meaningful guarantees. Seeing a data set in isolation makes it hard, if not impossible, to decide whether the data are successfully de-identified. As we observed already, this depends on what an attacker trying to re-identify individuals knows, and what additional information may be available from other sources. Without knowing who may try to do the re-identification, and what information they possess, or which individuals or what new information they are interested in, we cannot decide if a data set is safe for publication. If we know, however, the method (i.e., the algorithm) through which the data was analyzed to produce some by-product of it, such as a table of counts or a machine learning model, then we can actually make guarantees that hold against any possible attacker, with any kind of side information. This insight was developed initially in work by Dwork, McSherry, Nissim, and Smith [DMNS06], who introduced a seminal framework for private data analysis known as differential privacy. Differential privacy has seen an increasing number of recent adoptions, both in industry (by Google, Apple, Facebook, among others) and by official statistics agencies, most notably the US Census Bureau starting with the 2020 Decennial Census.

As we mentioned, differential privacy defines privacy with respect to an algorithm used to analyze data, rather than with respect to the data themselves.

Informally, the definition can be described using a thought experiment. Imagine that we execute the algorithm on a data set, and we also execute it on the data set with the data of one individual modified. To be differentially private, the algorithm must behave almost identically in these two scenarios, and this must happen for all data sets and all modifications to the data of any one individual. For example, a differentially private algorithm must have the property that if it produces a particular machine learning model from a data set that includes your data, then it is almost as likely to produce the same model if your data were removed or changed completely. This means that an adversary observing or using the model output by this algorithm is unable to tell whether your data (or any other person’s data) were used to train the model at all. Then, whatever information an attacker may learn about you from observing the algorithm’s output could have also been learned even if your data were never seen by the algorithm.

Differentially private algorithms have the property that they reveal statistical information about the data, such as, for example, correlations between genetic factors and medical conditions, but do not reveal information about particular individuals, such as whether they have some medical condition. This is because statistical information does not depend crucially on any one person’s data, while private personal information does. We should note that the actual technical definition is quantitative, and the level of privacy protection is tunable, allowing us to trade privacy off for the machine learning model’s accuracy, or to query answers produced by the algorithm.

Let us illustrate this with a short example of how differential privacy can be achieved. Suppose we just want to release two counts, such as our previous examples “How many computer science professors at the University of Toronto smoke?” and “How many computer science professors at the University of Toronto who were not born in Bulgaria smoke?” Then we can add some random noise to each count, for example by flipping some number of coins and adding the number of tails we get to the first count, and doing the same with new coins for the second count. Now, even if an attacker subtracts the two noisy counts from each other, the result they get will be dominated by the random noise, and they cannot tell if this blog post’s authors smoke. The more coins we flip, the better the privacy protection. At the same time, we can achieve good privacy protection with many fewer coin flips than the number of computer science professors at theUniversity of Toronto, giving us reasonably accurate answers to the two counting questions.

Conclusion

Given what the scientific community has learned about the risks of re-identification attacks, and about defining privacy protection rigorously, we advocate that de-identification be interpreted in terms of how data is analyzed. Combined with proper access control measures to ensure that the data themselves are not directly accessible, analyzing data with algorithms that provide rigorous guarantees of privacy allows us to envision a future where respectful, yet nevertheless useful, applications of data analysis are possible.

References

[CN] Aloni Cohen, Kobbi Nissim. 2020. “Linear Program Reconstruction in Practice”. Journal of Privacy and Confidentiality 10 (1).

[CTC20] Christopher A. Choquette Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot. Label-Only Membership Inference Attacks.

[CRT19] Chris Culnane, Benjamin Rubinstein, Vanessa Teague. Stop the Open Data Bus, We Want to Get Off

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. TCC 2006

[DN03] Irit Dinur, Kobbi Nissim. Revealing information while preserving privacy. PODS 2003.

[GAM18] Simson Garfinkel, John Abowd, Christian Martindale. Understanding Database Reconstruction Attacks on Public Data. ACMQueue, Vol. 16, No. 5 (September/October 2018): 28-53.

[MN12] Aleksandar Nikolov, S. Muthukrishnan, Optimal Private Halfspace Counting via Discrepancy. STOC 2012.

[S00] Latanya Sweeney. Simple demographics often identify people uniquely. Health 671, 1–34 (2000).

[SSS17] Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov. Membership Inference Attacks Against Machine Learning Models.