Remember “anonymized” data? The kind retail customers felt OK about sharing, because it couldn’t be tracked back to a specific individual? Turns out, according to the very bright folks at MIT, that anonymized is quite different from anonymous.
Researchers at MIT studied three months of credit card records for 1.1 million people, and the results were published today in the journal Science. For 40% of people, it took only two data points, without price information, to identify a customer. Five data points was enough to ID virtually everyone.
The ability to preserve anonymity in large data sets is important, because aggregated digital data can be a rich source of customer insights. Retailers analyzing anonymized credit-card histories could learn about customer tastes, trend information, seasonal variations and even weekly shopping rhythms. But re-identification is dangerous, and it will make customers less willing to give up any personal information and more likely to pay with cash.
What about coarsening the data, making it intentionally vague? The MIT researchers examined the effects of attempting to maintain privacy while still obtaining some useful analysis. The results were not encouraging. According to MIT News: “Even if the data set characterized each purchase as having taken place sometime in the span of a week at one of 150 stores in the same general areas, four purchases (with 50 percent uncertainty about price) would still be enough to identify more than 70 percent of users.” Therefore, a data set’s lack of names, home addresses, phone numbers or other obvious identifiers does not make it safe to release publicly.
Credit card data has great potential for retailers and should be gathered, but care must be taken to avoid re-identification. But the study does highlight the standard methods many companies use to anonymize their records. And it will certainly add fuel to the fire of those concerned about the consumer-tracking processes employed by advertising software and analytics companies.