Open Data, and the Myth of Masking Identity

The history of unmasking people behind data is long and storied — forgers, spies, and fakes have been unmasked time and again — yet the trap of believing that discernible clues don’t linger in small details remains one we wander into again and again.

Currently, the explosion of available data has led the scientific community to embrace open data (reasonable), with little thought of the consequences (unreasonable), and the casual assertion that personal data can be de-identified (reckless).

In the 1990s, the governor of Massachusetts allowed anonymized voter rolls and health data to be released. Unfortunately, by matching three field (ZIP code, birthdate, and sex), researchers were able to quickly identify the fact that the governor had a chronic health condition he’d previously kept secret.

This is all brought up to the present day in an excellent essay in the most recent issue of NEJM by Joel Schwartz, who tackles the EPA’s misguided, cynical, and calculated gambit to only base policy on publicly available data.

Schwartz provides an overview of the ways in which previous attempts to mask data have failed. Studies that include geocodes are particularly vulnerable. A newspaper article about victims of Hurricane Katrina included only a map of locations of deaths, with no roads included. Researchers were able to correctly identify the actual address for most of the people who died.

Studies that control for multiple variables (common in medical and public health research) often include data that is geocoded secondarily, such as socioeconomic status, race, or housing value, so it’s possible to identify each individual’s census tract. Continuous confounders (e.g., age) only make it easier to identify individuals.

An NAS study reported on an “experiment to discover whether confidentiality could be preserved while opening . . . data for public review.” The experiment found that even after everything was deleted that wasn’t needed so that other scientists could replicate a study’s basic findings, investigators could still identify the participants.

HIPAA is the gold standard of privacy in medicine and health care information, yet a study examining the identifiability of records from Northern California found that even de-identified HIPAA-compliant data was identifiable more than 25% of the time.

Schwartz’s essay is worth reading, as it’s clear the EPA is mimicking a tobacco industry gambit to undermine research by pushing for “transparency.” With GDPR and Canadian privacy laws, as well as researchers’ prudence about releasing personally identifiable data, the data pool for making policy shrinks vastly using this tactic, exactly what anti-science groups want.

As for reanalysis, Schwartz smartly writes:

. . . the “gold standard” of science is not reanalysis, but replication. . . . Of what value . . . is a reanalysis of a minimal subset of covariates from any given study — particularly if it can’t control for important covariates?

The longer we casually or actively promulgate the myth that open data is a pure good, and that nefarious uses of the rhetoric or the actual data aren’t occurring and far too feasible, the more unnecessary risk we court.

Open science advocates may have their hearts in the right place. It’s time to get their heads in the game now.