Balancing Data Privacy and Data Utility in Synthetic Data

1. What is Synthetic Data:

In contrast to real data collected through real-world events by direct measurement or observations, synthetic data is artificially generated to closely replicate the statistical properties, relationships, and patterns of real-world data allowing it to look, behave, and function like real data. In addition to that, it is entirely anonymous in compliance with data protection regulations.

Synthetic data is a subset of Generative Artificial Intelligence (GenAI), created using advanced Machine Learning (ML) models such as Generative Adversarial Networks (GANs), Large Language Models (LLMs), Variational Autoencoders (VAEs), or Deep Generative Models (DGMs) generally. These models are trained on real datasets to learn their statistical characteristics at a granular level. Once trained, these models produce synthetic data that is virtually indistinguishable from the original training data in terms of structure and utility.

Since it does not contain any Personally Identifiable Information (PII) or direct links to real individuals, synthetic data ensures complete anonymity. This makes it easier for organizations to balance data utility and fidelity with stringent privacy laws, offering a robust alternative to traditional data anonymization techniques.

Synthetic data can exist in various forms, including:

Tabular data: Rows and columns akin to structured datasets, such as spreadsheets.
Relational data: Multiple datasets that are joined together with Primary Key (PK) and Foreign Key (FK) relationships.
Time-series data: Sequential data points that capture trends over time.
Text data: Human-like natural language sentences or structured textual information.
Image and video data: Pixel-perfect replicas or new creations of visual content.
Audio data: Sounds or speech signals designed to replicate or augment existing recordings.

At Betterdata, we have developed state-of-the-art (SOTA) models for tabular and relational models, while models for sequential and text data generation are under development (reach out if you would like to collaborate and/or provide us feedback). These models are designed to generate high-quality synthetic data fit for enterprise AI/ML applications and other data-intensive use cases. Contact us to learn more!

‍

‍

2. Core Principles of Privacy in Synthetic Data:

‍

a. Data Anonymization by Design:

Synthetic data does not simply replicate real-world user records but creates new ones that are based on the statistical and behavioral patterns of real records. As such it excludes all sensitive user or customer information making it non-traceable directly or indirectly. Unlike legacy data anonymization techniques such as masking, encryption, or tokenization which are directly applied to real data, synthetic data generates new datasets minimizing the risk of re-identification.

‍

b. Differential Privacy:

Differential Privacy (DP) was developed to address the shortcomings of traditional de-identification methods, which often suffer from re-identification risks. DP introduces randomness into input data and output queries through probability distributions like the Laplace distribution (double exponential distribution). This randomness is regulated by the epsilon (ε) parameter, which controls the degree of privacy loss. A smaller ε ensures stronger privacy, making it harder to detect the presence or absence of a specific individual in a dataset.

In general, DP compares two nearly identical datasets that differ by just one individual record. For example, if Dataset A has 100 records, then Dataset B should have either 99 or 101 records for DP to hold. This ensures that the outcome of any analysis is nearly the same regardless of whether a specific individual's data is included or excluded. This approach maintains privacy while still allowing statistical validity and generalizability. DP is typically implemented in two ways:

Central Differential Privacy (CDP): Noise is added to the data after collection but before analysis.
Local Differential Privacy (LDP): Noise is added at the individual level before data is even collected.

The DP composition property allows it to handle multiple queries for analysis by controlling cumulative privacy loss, ensuring robust privacy guarantees even as additional external information becomes available. Moreover, DP’s ability to anonymize data while preserving useful trends enables analysts to gain accurate insights about the population as a whole without compromising individual privacy. For synthetic data, DP ensures that no individual data can be inferred from the generated dataset.

By integrating DP into synthetic data generation, Betterdata ensures unparalleled privacy protection, marking its position as a pioneer in privacy-first synthetic data solutions and allowing enterprises to leverage sensitive datasets confidently without worrying about privacy risks.

‍

‍

c. No Direct Lineage to Source Data:

Synthetic data generation techniques such as GANs, VAEs, or Agent-Based Models (ABMs) ensure there is no direct lineage to the source data. GANs, for instance, involve a generator network that learns to produce realistic data by mimicking the statistical distribution of the original dataset, while the discriminator ensures that the generated data is indistinguishable from real data. On the other hand, VAEs use a latent space representation, abstracting individual data points into a compressed format before decoding them into synthetic data, which inherently breaks any link to the original records. ABMs simulate behaviors and interactions within a system, such as consumer purchasing patterns or urban traffic flow, creating datasets that reflect realistic dynamics without copying actual data. These techniques fundamentally ensure that the synthetic data does not reproduce or contain identifiable features from the original dataset, providing a strong and practical framework for privacy protection.

‍

3. Betterdata PII Detection Philosophy:

Betterdata is at the forefront of privacy innovation, leveraging synthetic data and advanced privacy-preserving techniques to redefine how organizations protect PII data.

Betterdata's approach to privacy is securing the end-to-end synthetic data generation pipeline. Unlike traditional anonymization methods, our synthetic data does not replicate real-world records but generates data that has the same statistics as the original dataset. Our goal is to generate realistic synthetic data while minimizing the probability of user re-identification.

It is worth noting that identifiable information can encompass a wide range of data that can be traced back to real individuals, from sensitive details like credit card numbers to seemingly trivial attributes such as the model year of your car. To effectively safeguard such sensitive features, DP is typically used as it is seen as the gold standard of privacy evaluation. By injecting noise into datasets, DP ensures that individual data points cannot be cross-referenced or linked, thereby protecting privacy while preserving the utility of the data.

‍

4. Betterdata’s PII detection model for LLMs:

As professionals and organizations increasingly rely on public chat interfaces like ChatGPT, the risk of inadvertently leaking PII data grows. Names, dates, API keys, and other sensitive information often find their way into these conversations, creating significant privacy concerns. Addressing this challenge, Betterdata has developed a powerful PII detection model that ensures data privacy while maintaining context for downstream applications.

Built on the Qwen2-0.5B decoder transformer model, this PII detection model can identify and mask sensitive information with class tags. It effectively spans 29 PII classes across seven languages, including English, Spanish, Swedish, German, Italian, Dutch, and French. Trained using publicly available datasets under permissible licenses and synthetic data, the model provides robust multilingual support for diverse use cases.

‍

Check out our model on HuggingFace.

‍

a. Key Features:

High efficiency: Optimized for low latency, the model is lightweight and runs efficiently on CPUs, making it accessible for various deployment scenarios.
Customizability: Users can define custom classes for PII detection, tailoring the model to their specific needs.
Scalability: While the Qwen2-0.5B serves as a baseline, Betterdata also houses larger, more powerful models for advanced use cases.

‍

b. Use Cases:

Whether you’re a developer integrating third-party APIs or an organization handling large-scale user interactions with LLMs, the PII detection model is a vital tool. It enables:

Enhanced privacy: Mask PII to prevent data leakage.
Seamless context preservation: Retain critical context for processing without exposing sensitive data.
Ease of deployment: Run the model locally on your infra using CPUs only, making it accessible for all types of environments.

‍

c. Key Considerations:

Latency and accessibility: The model’s lightweight design ensures low latency, making it suitable for real-time applications.
Multilingual support: Coverage for seven major languages makes the model versatile across a variety of use cases and beyond borders.
Scalability: Users can define custom PII classes to tailor the model to their specific needs, ensuring adaptability across industries.

‍

d. Limitations and Bias:

As with any AI model, there are limitations. While the PII detection model performs well on standard classes, edge cases, and extended classes pose challenges. However, with constant updates and expanded training datasets, Betterdata is addressing these limitations.

‍

e. Performance and Future Improvements:

While the model excels in detecting common PII like names, emails, and phone numbers, there is ongoing work to improve its accuracy on more complex classes such as API keys, credit card CVVs, and bank account numbers. Betterdata continues to invest in generating additional data for these classes, ensuring high-performance gains with each model update.

Betterdata’s PII model out-of-distribution testing has shown promising results, reinforcing its reliability in real-world applications. Users can expect improved coverage and accuracy with our enterprise version models, as Betterdata has its R&D team that actively works to refine and expand its model capabilities.

‍

5. The Difference Between Synthetic Data and Anonymized Data:

While synthetic data and data anonymization aim to protect data privacy, their applications and effectiveness differ significantly. Synthetic data is artificially generated and carries little to no risk of re-identification, which future-proofs privacy concerns. In contrast, traditional data anonymization techniques rely on destroying sensitive information to protect customer data.

For a long time, data anonymization has been sufficient for enterprises. However, with the rapid advancements in AI/ML, it has become increasingly evident that anonymized data no longer meets the demands of data-first enterprises. The challenge becomes even bigger for enterprises handling vast amounts of private data and training ML models that require high-quality and high-dimensional datasets. Traditional data anonymization methods such as encryption, k-anonymity, and pseudonymization, significantly degrade data utility and quality, rendering them inadequate for AI/ML applications.

This puts organizations in a difficult position: they must either protect privacy or quality, but not both (this of course leads to undesirable consequences more often than not). Synthetic data, however, provides a more practical solution since it automates the balancing of generating high-quality data with just the right kind of privacy protection.

Unlike anonymization, synthetic data also offers data augmentation to address issues such as removing bias, missing values, and imbalanced datasets, enabling enterprises to generate better insights and achieve higher performance in their AI/ML models.

‍

‍

6. Balancing Data Privacy and Data Utility (Betterdata’s business-first approach):

When it comes to balancing data privacy with utility, there is no single correct answer—nor is there a universally recognized framework. Every enterprise operates with a unique data schema, making it impractical to apply a one-size-fits-all approach to privacy and utility. What works for one organization may not be suitable for another.

At Betterdata, we take a business-first approach, prioritizing the needs of our clients by assessing their data landscape, identifying sensitivity levels (i.e., direct identifiers vs. quasi-identifiers), and then advising on the appropriate Differential Privacy (DP) strategies.

‍

a. Understanding ε in DP:

In DP, recall ε (epsilon) controls the trade-off between privacy and utility:

A smaller ε ensures stronger privacy, making it harder to detect the presence or absence of a specific individual.

An ε of 0 provides the highest level of privacy (but renders the data completely useless).

An ε > 0 introduces some privacy loss in exchange for better data utility.

The maximum value of ε depends on the application, privacy budget, and the level of acceptable privacy risk.

‍

b. Real-world applications of ε in DP:

Different organizations adopt varying ε values based on their privacy tolerance:

Google (Chrome’s RAPPOR & Telemetry Data): ε ≈ 0.5 - 8

Apple (Differential Privacy in iOS & macOS): ε ≈ 2 - 14

U.S. Census Bureau (2018 Census Data Release): ε ≈ 8.9

Meta (Facebook DP Applications): ε ≈ 0.5 - 10

Academic and industry research: Some studies use ε > 100 to balance accuracy and privacy.

‍

c. What is the highest ε ever used?

There is no global maximum for ε, as its value depends on specific use cases:

Theoretical studies have explored ε values 100+ to evaluate accuracy trade-offs.

Certain industrial applications have used ε 50-100, prioritizing data utility over privacy.

‍

d. What does a high ε even mean?

A low ε (e.g., ≤1) offers stronger privacy but reduces data accuracy.

A high ε (e.g., >10-100) improves utility but weakens privacy, increasing re-identification risks.

‍

e. Betterdata’s approach to optimizing ε:

At Betterdata, we generate synthetic data with an ε value tailored to business needs first to ensure data remains useful. Only then do we optimize ε towards the lowest possible value (without ever reaching 0). Our goal is to minimize ε while preserving data quality, ensuring enterprises benefit from high-utility synthetic data while maintaining strong privacy protection using DP (the gold standard of privacy).

This approach empowers organizations to retain data quality, mitigate privacy risks, and confidently comply with data protection regulations.

‍

Learn more about how Betterdata does privacy scoring for synthetic data in a guide Betterdata co-authored with Singapore Personal Data Protection Commission (Annex E: Approach 3): https://www.pdpc.gov.sg/help-and-resources/2024/07/proposed-guide-on-synthetic-data-generation

‍

7. Privacy Risk Assessment:

Evaluating the privacy risk of a dataset involves determining the likelihood of re-identifying an individual, either by linking records within the dataset or by cross-referencing it with external data sources. A critical aspect of this assessment is distinguishing between population-level risk and individual-level risk.

Population-level risk: This refers to the privacy risks associated with aggregated information about the dataset as a whole. While it does not directly identify individuals, it can reveal patterns, trends, or demographic characteristics that, when combined with other data, could lead to re-identification or misuse. Examples include ZIP codes, income levels, and ethnic distributions.
Individual-level risk: This focuses on the likelihood that a specific individual can be re-identified within a dataset. Even when direct identifiers (e.g., name, social security number) are removed, re-identification is still possible through quasi-identifiers, i.e., indirect attributes such as birth date, gender, and location. Advanced data linkage techniques can further increase the risk, especially when multiple datasets are combined.

Additionally, organizations must always be mindful of PII data, which directly links to individuals, such as names, ID numbers, credit card details, or health records.

‍

a. Mitigating Re-Identification Risks:

To protect privacy, data must be broken down, assessed both individually and collectively, and safeguarded accordingly. The guiding principle is: "Break the data to save it." If a dataset can be linked to identify individuals, others can exploit it too.

‍

b. Key Strategies to Reduce Re-Identification Risks Include:

Data minimization: Limiting data collection, sharing, and cloud storage to only what is necessary.
Robust data retention policies: Ensuring data is not stored longer than required.
Privacy impact assessments: Evaluating risks before data is processed or shared.

These measures apply whether the data is real, synthetic, or anonymized, strengthening privacy and security while enabling responsible data usage.

‍

8. Types of Privacy Risks:

‍

a. Singling out:

This occurs when an individual can be uniquely identified, even if direct identifiers are removed. This often happens through indirect identifiers like age, ZIP code, or occupation, especially when combined with other datasets.

b. Linkability risk:

This arises when multiple datasets are cross-referenced to reveal an individual’s identity. For example, connecting a dataset of patient medical conditions with another that includes ZIP codes or IDs can uncover sensitive profiles, even without direct identifiers.

c. Inference attacks:

These exploit patterns and relationships in synthetic datasets to deduce private details. These can involve techniques like attribute inference, background knowledge exploitation, or differential attacks, putting sensitive information at risk even in synthetic datasets.

d. Membership inference:

These focus on identifying whether an individual’s data was used in training an ML model. By analyzing model behavior, such as confidence levels, attackers can infer the inclusion of specific data points, posing significant risks for sensitive datasets.

e. Data leakage:

Data leakage occurs when information outside the intended dataset improperly influences model training, such as test data appearing in training. This compromises both privacy and model reliability, often leading to misleading performance metrics.

‍

9. Synthetic Data Privacy Scoring Metrics:

‍

a. Exact match score:

Evaluates how often records in a synthetic dataset exactly match records in another dataset, indicating potential privacy risks.

b. Neighbors privacy score:

Assesses how similar individual records are to others in the dataset, with higher scores signaling better privacy preservation.

c. Membership inference score:

Measures the likelihood of identifying whether a specific individual’s data was part of the training dataset.

d. Leakage score:

Quantifies how much sensitive information has leaked into model training, affecting both privacy and performance.

e. Proximity score:

Reflects the uniqueness of individual records by comparing them to neighbors in the dataset, with lower scores posing higher re-identification risks.

‍

Learn More About Privacy Scoring and Privacy Risks Here.

‍

10. Ending Thoughts:

Enterprises need data. However, access to data is heavily protected by global data protection agencies due to growing data privacy concerns among consumers. Betterdata helps you surpass this mildly annoying but very much-needed obstacle. Our state-of-the-art (SOTA) models such as ARF, TabTreeFormer, or TAEGAN among others have been developed to generate high-quality and realistic synthetic tabular and relational data. Models for sequential and text data are being developed and will be launched later this year.

Synthetic data due to its advantages such as being highly secure enabling organizations to access and share data quickly and safely, while preserving data utility allowing organizations to double down on data-driven innovation makes it the most reliable and cost-effective solution to overcome data scarcity, data quality, and data privacy challenges, especially in use-cases such as enterprise AI/ML.

‍

Dr. Uzair Javaid