Header Ads

Differential Privacy Explained

Differential Privacy Explained

These days companies are using more and more of our data to improve their products and services and it makes a lot of sense. If you think about it it's better to measure what your users like than to guess & build products that no one wants to use. However, this can be very dangerous. It undermines your privacy because the collected data can be quite sensitive causing harm if it would leak. So companies love data to improve their products but we as users we want to protect our Privacy.

These contradicting needs can be satisfied with a technique called differential privacy.

It allows companies to collect information about their users without compromising the privacy of an individual.

Why we would go through all this trouble?

Companies can just take our data remove our names and call it a day. Right?
Well not quite. First of all this anonymization process usually happens on the servers of the companies that collect your data so you have to trust them to really remove the identifiable records. And Secondly how anonymous is anonymized data really? In 2006 Netflix started a competition called the Netflix Price.

Competing teams had to create an algorithm that could predict how someone would rate a movie. To help with this challenge Netflix provided a dataset containing over 100 million ratings submitted by over 480,000 users for more than 17,000 movies.

Netflix of course anonymizes this data set by removing the names of users and by replacing some ratings with fake and random ratings. Even though that sounds pretty anonymous it actually wasn't. Two computer scientists from the University of Texas published a paper that said that they had successfully identified people from this data set by combining it with data from IMDb. These types of attacks are called linkage attacks and it happens when pieces of seemingly anonymous data can be combined to review real identities.

Another more creepy example would be the case of the governor of Massachusetts. In the mid-1990s this State's Group Insurance Commission decided to publish the hospital visits of state employees. They anonymized this data by removing names addresses and other fields that could identify people. However, computer scientist LaTanya Sweeney decided to show how easy it was to reverse this.

She combined the published health records with voter registration records and simply reduced the list. There was only one person in the medical data that lived in the same Zip code had the same gender and the same date of birth as the governor thus exposing his medical records. In a later paper, she noted that 87 % of all Americans can be identified with only 3 pieces of information Zip code, birthday, and gender.

So much for anonymity. Clearly, this technique isn't enough to protect our Privacy. Differential Privacy, on the other hand, neutralizes these types of attacks to explain how it works Let's assume that we want to get a view on how many people do something embarrassing like example picking their nose? To do that we set up a service with the question Do you pick your notes? And with the yes and no buttons below it, we collect all these answers on a server somewhere.

But instead of sending the real answers, we're going to introduce some noise. Let's say that Bob is a nose speaker and that he clicks on the Yes button before we send his response to the server. Our differential Privacy algorithm will flip a coin. If it's heads the algorithm sends Bob's real answer to our server. If it tails the algorithm flips the second coin and sends Yes. If it tails or no, if it's heads back on our server we see the data coming in. But because of the added nose, we can't really trust individual records.

Our record for Bob might say that he's a nose speaker but there is at least a one in chance that he's actually not a nose speaker but that the answer was simply the effect of the coin toss that the algorithm performed. This is plausible deniability. You can be sure of people's answers so you can judge them on it. This is particularly interesting if you're collecting data about illegal behavior such as drug use for instance.

Now because you know how the noise is distributed you can compensate for it and end up with a fairly accurate view on how many people are actually nose speakers. Now, of course, the Coins algorithm is just an example and a bit too simple. Real-world algorithms use the LaPlace distribution to spread data over a larger range and increase the level of anonymity. In the paper the Algorithmic Foundations of Differential Privacy it is noted that differential Privacy promises that the outcome of a survey will stay the same whether or not you participate in it.

Therefore you don't have any reason not to participate in the survey. You don't have to fear that your data in this case your no picking habits will be exposed. All right so now we know what differential Privacy is and how it works.

Who is already using it?

Apple and Google are two of the biggest companies who are currently using it. Apple started rolling out differential Privacy in iOS 10 and macOS Sierra. They use it to collect data on what websites are using a lot of power what images are used in a certain context and what words people are typing that aren't in the keyboards dictionary. Apple's implementation of differential Privacy is documented but not open source.

Google on the other hand has been developing an open-source library for this. They use it in Chrome to do studies on browser malware and in maps to collect data about traffic in large cities. But overall there aren't many companies who have adopted differential Privacy and those who have only used it for a small percentage of their data collection.

Why is that?

Well for starters differential Privacy is only usable for large data sets because of the injected noise. Using it on a tiny data set will likely result in inaccurate data. And then there is also the complexity of implementing it. It's a lot more difficult to implement differential Privacy compared to just reporting the real data of users and anonymize it in an old-fashioned way. 

So the bottom line is that differential Privacy can help companies to learn more about a group of users without compromising the privacy of an individual within that group. Adoption However is still limited but it's clear that there is an increasing need in ways to collect data about people without compromising their privacy.

No comments

please do not enter any spam link in the comment box.

Powered by Blogger.