Your identity in databases might not be as secure as researchers once thought it was, according to a paper published recently in Nature Communications, in which researchers shared code that can identify nearly 100% of Americans from practically any available dataset with as few as 15 attributes, Gina Kolata writes for the New York Times.
Is anonymized data really anonymous?
Throughout most of the world, data are not personal and can be sold and shared without running afoul of privacy laws. However, while data are anonymized to obscure individual identities, data still contain plenty of "attributes" about a person or household, Kolata writes.
In the paper published in Nature Communications, the researchers revealed they had developed a computer program that could identify 99.98% of Americans from nearly any available data set with as little as 15 attributes, including gender, marital status, and ZIP code.
Yves-Alexandre de Montjoye, a computer scientist at Imperial College London and lead author of the paper, said the study shows current methods of anonymizing data are insufficient. "We need to move beyond de-identification," he said. "Anonymity is not a property of a data set, but is a property of how you use it."
To share or not to share
Typically, when researchers discover a security flaw, they report the flaw to the vendor or the government, Kolata said. But in this case, anonymized data is everywhere all over the world, and all of it is at risk, de Montjoye said.
That left the researchers with a choice: Say nothing, or publish the code so data vendors can secure future data.
They decided to publish it. "This is very hard," de Montjoye said. "You have to cross your fingers that you did it properly, because once it is out there, you are never going to get it back."
Yaniv Erlich, chief scientific officer at MyHeritage, agreed with the researchers' decision. "It's always a dilemma," he said. "Should we publish or not? The consensus so far is to disclose. That is how you advance the field: Publish the code, publish the finding."
How to solve the problem
The finding raises questions about how best to protect data that are supposed to be anonymized.
One way to limit the privacy risk of anonymized data is by controlling access to the data, Kolata writes. For example, if someone wants personal data like medical records, accessing them would have to be done in a secure room where the data cannot be copied and everything that is done with the data is recorded.
According to Kamel Gadouche, CEO of C.A.S.D., a research center in France that utilizes these methods, researchers would be able to access the data remotely, but "there are very strict requirements for the room where the access point is installed."
However, this method isn't perfect, Kolata writes. If researchers want to confirm the results of a research paper for a scientific journal using that data, accessing that data would be a challenge.
Another potential solution is what's called "secure multiparty computation," Kolata writes.
"It's a cryptographic trick," Erlich said. "Suppose you want to compute the average salary for both [of] us. I don't want to tell you my salary and you don't want to tell me yours." So encrypted information is provided and decoded by a computer.
"In theory, it works great," Erlich said. However, for scientific research, the method is somewhat limited. For example, if the end result seems incorrect, "you cannot debug it, because everything is so secure you can't see the raw data," Erlich said.
Ultimately, Erlich said data gathered on people will never be entirely private. "You cannot reduce risk to zero," he said (Kolata, New York Times, 7/23).