Data Privacy via Integration of
Differential Privacy and Data Synthesis

My 3 Minute Thesis, where I won People's Choice

Lay Abstract: As the era of information and technology continues to dominate, big data offers tremendous benefits for education, economics, medical research, national security, and other areas through data-driven decision making, insight discovery, and process optimization. However, there are significant challenges in analyzing big data – data with high volume, high velocity, and/or high variety information assets that require new forms of processing. One of the crucial concerns is the extreme risk of exposing the personal information of individuals who contribute to the data when sharing it among collaborators or releasing it publically. An intruder could identify a participant by isolating the numerous connections to other contributors within the dataset. For instance, in 2006, AOL released over 20 million Web search queries of 657,000 "anonymous" users to the scientific research community. Within a few days of the release, The New York Times was able to identify one of the users (Gehrke, 2012). Some participants in other research studies provide inaccurate or no information about themselves, knowing the possibility of this kind of invasion. Furthermore, government agencies such as the U.S. Department of Labor are required to release statistical material worldwide, including education and health that is fully confidential. While big data has big rewards, there is great concern whether personal information can ever be private.

Statistical disclosure limitation (SDL) has become a popular approach in addressing confidentiality issues. SDL protects data privacy by using well-founded statistical techniques while minimizing information loss from data perturbation to allow valid and integrated statistical analysis. Some simple provisional SDL approaches to data confidentiality include deleting identifiers, top-coding, and aggregation. However, these techniques either do not provide enough protection or distort the raw data, especially for big data, resulting in invalid statistical inferences. For example, Washington State sells patient-level health data for $50 that includes hospitalizations occurring in the state during a particular year, patient demographics, diagnoses, procedures, billing information, and more. The government removed all identifying personal information such as names and addresses. Sweeney (2013) linked information from newspaper articles and public records to the healthcare database and correctly identified dozens of patients. Additional Issues – Other than connecting two or more public databases to identify people, management and analysis of big data creates high demands on storage and computation as well as imposing unique data confidentiality challenges. For example, the protection of relationship data (such as Facebook and Twitter social data) is arguably more challenging than ordinary data since the sensitive information of a target person might be obtained through his/her relations with other people. Another complication is the risk of publishing sequential data over time – a common feature of big data. Unauthorized third parties may associate individuals across several releases and infer additional sensitive information that they could not obtain from a single release.

In combination with SDL, I use differential privacy to address these issues. The idea is "plausible deniability." A SDL method based on differential privacy creates a sanitized or altered dataset that contains either a particular person or a pretend version of this person, allowing anyone in the real dataset to claim he/she is not part of the dataset. Any analysis of the sanitized data should have the same results as the non-sanitized data, meaning SDL methods based on differential privacy will supply almost the same answers as if anyone had access to the original dataset. While this approach resolves several big data issues, the challenge and my research is to find the SDL algorithm that satisfies differential privacy, balancing a high level of privacy protection for participants and preserving the statistical integrity of the dataset for researchers.

References: 1. Gehrke, J. (2012), Quo vadis, data privacy 2. Sweeney, Latanya (2013), Matching Known Patients to Health Records in Washington State Data 3. Dwork, Cynthia, et al. (2006), Calibrating noise to sensitivity in private data analysis