1. What is Tabular Synthetic Data:
Tabular synthetic data is artificially generated data representing real data's statistical properties created through training and running advanced machine learning algorithms using GANs or LLMs. Because it mimics real-world events and preserves data utility without revealing any Personally Identifiable Information (PII), it is a safer and better alternative to traditional data anonymization techniques since these anonymization techniques carry reidentification risks and limit data utility. However many business execs are still sceptical of incorporating synthetic data in their enterprise systems. Some common questions we have gotten from our meetings with top-tier banks, MNCs, and government organizations are,
- How do you ensure that the synthetic data retains its utility without compromising privacy?
- Can you demonstrate the privacy risks in your synthetic data generation process?
- What steps do you take to comply with privacy regulations like PDPA, PDPC, GDPR, etc?
- How do you ensure that individuals in the original dataset cannot be re-identified from the synthetic data?
- What privacy metrics do you use to evaluate the risk of privacy breaches in synthetic data?
And so on. A clear theme that we see is the concern for protecting individual and data privacy and being 100% certain that synthetic data protects data privacy while maintaining data utility. While we will get to data utility in another blog, This blog is dedicated to understanding how synthetic data protects data privacy and how can we empirically prove its effectiveness using privacy scoring metrics.
2. How Does Synthetic Data Protect Data Privacy:
As you know, tabular synthetic data is artificial data commonly generated through GANs and LLMs. These models are first trained on real data to learn statistical properties such as distributions, correlations, covariance, variables, outliers, anomalies, patterns, structures, etc. When the algorithm is trained it generates real-like synthetic data with the same statistical patterns but with no PIIs linking to any real individual making it exponentially easier, faster, and safer to use and share. This helps enterprises protect individual data privacy leaving no room for data breaches and hacks, hence protecting the enterprises from legal action and hefty fines.
3. Assessing Privacy Risks:
Assessing the privacy risk of a dataset involves evaluating the likelihood that individuals within the dataset can be identified or that sensitive information could be exposed. A key aspect of this assessment is distinguishing between population-level risks, which refer to generic information about the dataset as a whole, and individual-level risks, which focus on the potential exposure of specific individuals in the dataset. To start, it is important to understand the data content, including whether it contains personally identifiable information (PII) or sensitive data such as health or financial records. Indirect identifiers, like birthdates or zip codes, also pose a re-identification risk if they can be combined with other datasets. The uniqueness of records and the availability of external data that could be linked with the dataset should also be evaluated. Data sharing and access controls are another critical factor, as is ensuring proper data use agreements are in place. Privacy risk quantification models, such as differential privacy, can provide mathematical guarantees against re-identification. Additionally, it’s important to apply the data minimization principle, ensuring that only necessary data is collected and processed, and to assess data retention policies to avoid storing data longer than needed. Conducting a Privacy Impact Assessment (PIA) helps formally document potential risks and outline mitigation measures.
4. Key Privacy Risks in Tabular Synthetic Data:
a. Singling Out Risk:
Singling out risk is the possibility of an individual being uniquely identified or distinguished from a dataset, even when direct identifiers (such as names or social security numbers) have been removed. Generally, this happens by cross-referencing other data sources with indirect but related information (quasi-identifiers) in the primary dataset such as ZIP code, age, or occupation.
b. Link-ability Risk:
Not to be confused with singling out risk which cross-references indirect identifiers from other data sources within the same dataset to uniquely identify an individual. Linkability risk connects or links different datasets to identify an individual. For example, linking two datasets one of which has non-protected records of patient addresses, ZIP codes, ID numbers, etc, and the other has protected data about medical conditions can help build a complete profile of an individual, even if they are not directly identified in any single dataset.
c. Inference Attacks:
Inference attacks are privacy attacks where someone deduces or infers sensitive information by exploiting patterns, relationships, or other publicly available data, even in synthetic datasets (especially partially synthetic datasets). These attacks use indirect exploitation, linking non-sensitive data to infer private details. Inference attacks include attribute inference (deducing unknown attributes like medical conditions from demographics), record linkage (linking anonymized records to other datasets), homogeneity attacks (inferring sensitive information from uniform data groups), background knowledge attacks (using prior knowledge to identify individuals), and differential attacks (comparing query results to reveal hidden information).
d. Membership Inference:
Membership inference attacks occur when someone attempts to determine whether a specific individual's data was used in the training set of a machine learning model by exploiting the model’s behavior, such as its confidence levels or output patterns. By querying the model with particular data points, attackers can infer if certain data was part of the training set, especially if the model shows higher confidence for familiar data. These attacks are particularly concerning for sensitive datasets, such as medical or financial records, where privacy breaches could reveal an individual's PII in the training data.
e. Data Leakage:
Data leakage in machine learning occurs when information from outside the training dataset, such as test set data or future data, is improperly used during model training, leading to inflated performance metrics. Two common types are train-test leakage, where test data influences training, and target leakage, where predictors include information not available at the time of prediction. This results in overfitting and misleading accuracy, as the model performs well on seen data but fails in real-world scenarios due to learning patterns it would not have access to in practice.
5. Privacy Scoring Metrics for Tabular Synthetic Data
a. Exact Match Score:
An exact match score is used to determine whether two datasets contain identical records for the same individuals or entities. It evaluates how often records in one dataset can be exactly matched to corresponding records in another dataset without errors. A higher exact match score often indicates a potential privacy risk.
b. Neighbors Privacy Score:
The Neighbors Privacy Score measures the risk of identifying individuals within a dataset by comparing their records to those of their closest "neighbors" or similar entries in terms of attributes. A higher Neighbors Privacy Score indicates a lower likelihood of singling out or identifying individuals based on their attributes, making the dataset more privacy-preserving.
c. Membership Inference Score:
Membership Inference Score measures the probability of determining whether a specific individual's data was used in the ML model training dataset. It evaluates the model's vulnerability, where attackers exploit the model's behavior to infer whether certain data points were used during training. A higher Membership Inference Score indicates a greater risk of privacy breaches, as the model is more likely to expose information about its training data, disclosing the confidentiality of individuals in the dataset.
d. Leakage score:
Leakage Score quantifies the extent to which sensitive information from a dataset has been exposed or leaked through a machine learning model or system. It reflects how much data leakage, such as train-test leakage or target leakage, has occurred, where information that should not be available during model training is inadvertently used, leading to unrealistic model performance. A higher Leakage Score indicates greater information leakage, posing significant privacy risks as it can allow hackers to extract sensitive data or make unauthorized inferences.
e. Proximity score:
The Proximity Score in data privacy assesses how close an individual's data is to other data records based on key attributes. It measures the risk of re-identification by evaluating how distinct or similar a data point is to its neighbors. A lower Proximity Score suggests that an individual's data is unique, increasing the risk of re-identification, while a higher score indicates that the data is more common or similar to others.
6. Differential Privacy in Tabular Synthetic Data:
Differential privacy was developed to address the failures of traditional de-identification methods, which, as discussed before, contain re-identification risks. Differential privacy uses randomness to perturb data through probability distributions like the Laplace distribution (double exponential distribution), adding noise to the results of data queries. This randomness is regulated by the epsilon (ε) parameter, which controls the degree of privacy loss; a smaller epsilon ensures stronger privacy by making it harder to detect the presence or absence of any specific individual in the dataset.Differential privacy works by comparing two nearly identical datasets, which differ by just one individual’s data, where the outcome of any analysis is nearly the same regardless of whether a specific individual's data is included or excluded maintaining privacy while still allowing statistical validity and generalizability. Differential privacy has two types, central differential privacy (CDP), where noise is added to data after data collection and before analysis, and local differential privacy (LDP), where noise is added at the individual level before collection. The composition property of differential privacy allows it to handle multiple analyses by controlling cumulative privacy loss, while future-proof privacy ensures that the privacy guarantee holds even when additional external information becomes available.Differential privacy’s use of randomness to "blur" data allows for broad generalizability while maintaining privacy. This blurring means that while individuals are protected, analysts can still gain accurate insights about the population as a whole. Since the privacy of tabular synthetic data is difficult to evaluate quantitatively, differential privacy (DP) provides privacy guarantees, that ensure synthetic data cannot be traced back to real data or individuals.
7. The Future of Privacy Protecting Tabular Synthetic Data:
Simply put synthetic data is important for two reasons:
- Businesses need real unfiltered data to evolve.
- Businesses cannot use real unfiltered data due to increasing data privacy laws.
In a scenario where access to high-utility real data is limited, businesses can adopt synthetic data as a viable alternative to real data where you get access to high-utility and fast-moving synthetic data that looks, feels, and operates like real data. This opens up new avenues for innovation and improvement through the free flow of data for data analysis, product development, AI/ML training, and internal and external sharing. Gartner estimates that by 2030 synthetic data will surpass real data in AI/ML model training. This is becoming abundantly clear. Synthetic data offers a range of benefits apart from data privacy protection that real data just does not possess i.e. the ability to fill in missing data, remove biases, and create additional data for rare scenarios. Organizations like Apple, Tesla, and Amazon are already using synthetic data and with improvements in synthetic data generation technologies, the future of privacy-protecting tabular synthetic data looks bright.
8. Synthetic Data Guidelines by PDPC Singapore:
PDPC Singapore in collaboration with Betterdata, BioMedical Data Architecture & Repository (DAR), MRC, A*STAR, University of Ottawa, and National University of Singapore have published an extensive guide on synthetic data generation techniques and use cases. Read the complete guide here.