Netflix Inc. got lots of publicity from its million-dollar Netflix prize contest, as researchers vied “to improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences.”
Algorithms were developed using 480,000 records of customers and their associated ratings of movies, along with the date of each rating and the title and year of release of the rated movie. No other personal data was included; that is, it was supposedly anonymized, with customer names and other identification replaced with randomly assigned IDs.
They key word here is “supposedly.” Researchers at the University of Texas at Austin
wrote a paper in which they claim that it’s possible to sometimes de-anonymize data by correlating it with other datasets tied to a person’s real-world identity, such as a profile on the Internet Movie Database (IMDb).
In response to a lawsuit and an inquiry from the Federal Trade Commission over privacy concerns, Netflix has decided not to go ahead with a sequel Netflix Prize. This second contest was to have provided additional data such as renters’ ages, ZIP codes, and genders.
Even without combining information from multiple sources, some types of data are fairly straightforward to tie to a person. When a team at AOL released search data in 2006, The New York Timesshowed how easy it was to track down one of the searchers.
What surprised me at the time wasn’t the simplicity of the unmasking, but that so many people apparently didn’t do searches that made it obvious who they were. Certainly in my case, I have no doubt that my queries would map to my “true name” with little ambiguity if only because I frequently check to see what comes up if someone searches for me online.
And, even in the absence of such personally identifiable information, just a birthdate, ZIP code, and gender are sufficient to identify something in the neighborhood of 63 to 87 percent of the US population.
But aggregating data sources, as in the Netflix Prize example, shows how much you can reveal even when you may not think you’re revealing anything at all. The pipl search engine
gives some sense of the amount of personal information available for the mining; if you’re like me you may find the results a bit unnerving.
Clearly, common names make it harder to zoom in on a particular individual. Ongoing research may also provide ways
to make it possible to release data sets for research purposes that are effectively anonymized. However, it’s fair to say that as our digital footprints grow, the potential to connect the dots among different parts of that footprint grows as well.
Does it matter? Much of the online commentary seemed to take issue with the researchers, the FTC, and lawyers more than it did with Netflix. I suspect that’s because the data was going to a geekily respectable purpose, improving movie recommendations, rather than, say, to an insurance company or employer looking for reasons to deny coverage or a job.
But it’s worth noting that a federal law, the Video Privacy Protection Act, limits the disclosure of video rental information, so concern about this sort of information becoming public is hardly a newfound and academic concern. When that law was enacted, its purpose was quite narrow -- to keep political opponents or others from using video rental history to embarrass someone. (It was passed partly in reaction the publication of Robert Bork’s video history during his 1988 Supreme Court nomination process.)
Yet, in today’s interconnected world, such information is not just information in its own right. It’s also a potential window into other aspects of someone’s online identity.
— Gordon Haff, Senior Analyst at Illuminata Inc. on grids/supercomputing
Stopping Netflix from putting together a contest to try to give people better service, becuase of concerns that are already out of the proverbial barn seems like a lose-lose situation. We are not any better protected, and we dont get better movie recommendations. Researchers lose the chance to make some money by winning a contest.
Why couldnt they just add rules like only giving birth year and 3 digit zip code (first three digits)? Unless you believe astrology impacts movie choices, you dont need the date, just an age range.
Once something is on the Net it's no longer anonymous.
Much like when I went to High School. I graduated from a High School in a different country, along with barely 300+ in the 20 years the school was open. The school was an international school.
How hard would it be to find out my history there, at least as much history was available? Not very if someone wanted to search back to the mid 1960's.
At the Veteran's hospital I go tothere are three people with the same name as myself and one even has the same last four digits as my Social Secuity number. And this is a VA that is considered small. But there's a picture of me so that's a second identifier. I wonder if the NSA or some other governmental department has my information? Especially since I had a very high security clearance during my years in the military.
I think I'll try pipl and see what it says about me.
Where will all of this lead?!? Right now, every second, someone is sharing something they shouldn't about themselves, their families, and their friends (never mind intel on their business, their government, and other truly sensitive info).
Technological convenience is far outstripping our ability to understand the ultimate impact our online interactions will have. Back in the day (cough - 10 years ago), nearly everyone hid behind nicknames and pseudonyms. Now, we are blogging and twittering, reconnecting with high school friends on Facebook and posting our resumes for anyone to read.
All of this is data is immensely valuable information to all kinds of different groups (friends, co-workers, competitors, governments, crime syndicates). We are just now seeing the tip of the iceberg regarding the security (actually insecurity) of all of our tech advances...from thieves breaking into your home because you posted the fact that you were out of town to crime syndicates who have nearly unlimited information on how to steal your identity.
So, yea, this social networking craze is absolutely amazing - but we really should be worried where it will all lead because most of it is scary.
You had me at pipl search. Not only is a lot available online, it can be neatly organized for easy reading. I knew it was easy to find information online, I just wasn't aware it was that easy. The pipl search site could be used as an example to younger web users who aren't yet aware so much about them is easily accessible. Online reputation management services might have a big boom during the next several years as more people learn how easily they can be found online.
If someone wants to find out some private info about you, they will. Any site that has information about can only protect you so much. Your friend wo takes a snapshot of your private facebooj profile and posts it. The person behind you at the library who snaps a photo with his cell phone of your banking page. The pressure is not on the govt' to protect us, nor is it solely on the shoulders of the sites...it is our own responsibility to be smart as well. If someone is willing to go to such great lengths as to reverse engineer TCP-IP packets and go through your trash and tape together your shredded documents...well...they are going to get it.
My background has been a matter of public record since my first Security Clearance Background Investigation took place in 1976. Every friend, Every Group associaion, Every address and every police contact I have ever had is on the books. And then a second time when I applied for a Federal Firearms License. And I have given you more personal information right here than you will ever find on the internet about me.
They only make the info public in anonymized forms...the problem is that the info can be pieced back together too easily. There's a good satirical clip about government/private companies overstepping and using such info in America: From Freedom to Fascism by Aaron Russo. Funnily enough, it doesn't look like Netflix has his film :-)
A lot depends on how common your name is. My name, for example, appears to be the only Gordon Haff with any Web presence. And even if there's some ambiguity it's usually not hard to narrow down the choices based on rough location. So what? Maybe nothing but having access to all that data at least raises the possibility of tying that known identity to other, supposedly, private/anonymous information.
The ThinkerNet does not reflect the views of TechWeb. The ThinkerNet is an informal means of communication to members and visitors of the Internet Evolution site. Individual authors are chosen by Internet Evolution to blog. Neither Internet Evolution nor TechWeb assume responsibility for comments, claims, or opinions made by authors and ThinkerNet bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.
For more than the past year, we've seen a steady drumbeat of announcements in the technology space that analysts and developers have taken to calling “NoSQL.”
Arms merchant or army? That's a fundamental question for vendors in the cloud computing space. Do they just sell their tooling to any and all comers, who then become the actual purveyors of hosted infrastructure, developer platforms, and software? Or do they offer their own cloud-based services, perhaps even keeping much of their technology in-house for competitive advantage?
Getting to Work on Smart Work: How IT Is Transforming the Implementation of the 'Internet of Things' Organizations in all industry sectors are becoming more instrumented, interconnected, and intelligent -- and that's changing the way they approach virtually every facet of their operations. It's up to IT to help organizations adopt a "Three I's" approach that leverages the emerging Internet of Things and enables them to work smarter. READ THIS eBOOK
your weekly update of news, analysis, and
opinion from Internet Evolution - FREE! REGISTER HERE
Wanted! Site Moderators Internet Evolution is looking for a handful of readers to help moderate the message boards on our site as well as engaging in high-IQ conversation with the industry mavens on our thinkerNet blogosphere. The job comes with various perks, bags of kudos, and GIANT bragging rights. Interested?
To save this item to your list of favorite Internet Evolution content so you can find it later in your Profile page, click the "Save It" button next to the item.
What can users today do to protect their online privacy? The simplest and most obvious option is to not use the Internet – at all. However, once all digital information is consolidated over the Internet, trying to protect digital identity by simply unplugging from the Internet becomes impossible – a fact that has manifest implications for civil liberties, Saunders says.
By 2011 the number of Internet-connected sensors will exceed 1 trillion, making your chances of doing anything or going anywhere unnoticed pretty much zero. Saunders talks about how the 'sensortization' of the Internet is eliminating the traditional divide between online and offline populations.
The 20th Century Internet was characterized by the ability to interact with other people and information on the Internet largely without anyone knowing who you were. The Internet of this century, conversely, will be defined by identity. Saunders explains how Internet users are unwittingly contributing to the demise of the anonymous Internet.
Data mining of social networks means people might face unforeseen consequences as a result of their seemingly innocuous personal choices and associations.
In the final episode of this series about the death of Internet anonymity, Saunders describes how the Internet of the future will start to attain a level of intelligence that requires no human intervention. Scary.
Steve Saunders talks about the risks inherent in uncontrolled, widespread profiling of Internet users, and how one day this practice could form the basis of a new industry, the Outernet, which in economic terms will have outgrown the commercial value of the Internet itself.
Search companies and social networks are collecting incredibly detailed information about their users, says Steve Saunders, who predicts that these 'profiles' could one day become commodities to be bought and sold by companies on 'profile markets' or 'identity exchanges’ – the digital DNA equivalents of the financial and commodities exchanges on which stocks, oil, and gold are traded.
One of the most important Internet issues of all time is being ignored by the media. In this three-part video series Steve Saunders explains how search companies are turning the tables on their users by creating user profiles for financial gain, and how soon this trend will explode into full scale profiling.
There's a public-policy war on copyright that nobody is winning, and inconsistencies in viewpoint and interpretation seem to be multiplying. We need to step back and think our policies over again, or we risk having a strategy that fails everyone.
Ultraviolet is an industry-wide attempt to standardize video content delivery across multiple platforms. Apart from the fact that it’s based in the cloud, relies on the DRM system, and isn’t backed by Apple… it sounds great!
The FCC's Sixth Broadband Report has a hidden secret. But here’s a hint: The regulatory body plans to regulate broadband as a telecommunications service.
Once defined by epic journeys, planning, and maps, the phrase "on the road" takes on new meaning in a digital age, where we can make all our decisions using our connected devices en route.