Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Synthetic Data vs Data Anonymization: Which is Better?

Dr. Uzair Javaid
June 3, 2024

Table of Contents

1. Understanding Data Anonymization:

Data Anonymization is a legacy method of protecting data privacy through information sanitization which involves encryption, destruction, or suppression of data to erase personally identifiable indicators that connect to individuals from the stored datasets. 

By encrypting or suppressing data, privacy teams essentially ensure that data cannot be viewed by anyone including the data teams that require access to unrestricted data sets, especially for advanced machine learning models. This makes anonymized data useless in most cases. Therefore for organizations, it comes down to either preserving data privacy or data utility. A choice with no good answer. This is why often data teams opt to work with highly sensitive real data, risking legal action if data is compromised.

But this isn’t all. Data anonymization is not as effective in protecting data privacy as you might think. 87% of the population could be reidentified using their gender, ZIP code, and date of birth by connecting them even though they are non-identifiable separately. While re-identifying anonymized data is not an easy task it is not impossible. This makes anonymized data not only not usable by data scientists but also endangers data privacy. 

2. Understanding Synthetic Data:

Synthetic Data is a subset of GenAI that looks, feels, and functions like real data. It is generated by running real data through advanced ML models that generate a 99% identical replica of the highly sensitive production data. The synthetic data created have nearly identical statistical properties and structure as the real data, and since it does not create any real consumer data it cannot be traced back to any individual making it privacy proof.  Therefore it is not bound by any data privacy and protection laws, making it easier and safer for organizations to access, use, and share data.

Furthermore, synthetic data is not only applicable to privacy protection. Other benefits of synthetic data include bias mitigation in datasets, and generating additional data for rare scenarios using limited real data. 

3. Key Differences Between Synthetic Data and Data Anonymization:

a. Data Realism: 

  • Synthetic data replicates real-world data with accuracy scores up to 99% preserving the structure and statistical properties of production data. 
  • Data anonymization retains all original data but with encrypted, suppressed, or destroyed identifiers damaging data quality 

b. Privacy Risks: 

  • Synthetic Data is 100% immune to any privacy risk since it does not contain data of any real individuals.
  • Anonymized data has a high risk of re-identification since identifiers can be linked together or encryption keys can be stolen.

c. Data Utility:

  • Synthetic Data maintains and in many cases improves data quality through bias mitigation and rebalancing, improving data utility and coverage.
  • Since anonymized data is encrypted or suppressed it reduces data utility in machine-learning models affecting data analysis accuracy.

d. Testing and Development:

  • Synthetic data can be engineered to cover rare scenarios not present in the original dataset, enhancing model robustness and performance, and allowing for extensive data testing. The generated data can also be monetized without any risk of data privacy breach.
  • Anonymized data is only required in instances where real-world data is crucial for testing and analysis. 

e. Regulatory Compliance:

  • Synthetic data is not regulated by data protection agencies since it is artificial data containing no identifiable markers linking back to real individuals.
  • Anonymized data is regulated and compliance is achieved by protecting any identifiable markers that can be used to obtain real user information. Since this data is owned by real individuals, depending on the data privacy laws different data protection controls have to be implemented. 

f. Application Scope:

  • Synthetic data has a wide variety of use cases among different scenarios and industries. Since it is artificially generated data it can be modified for rare scenarios, protect data privacy, data monetization, and machine learning.
  • Anonymized data is limited to specific scenarios often small scale with limited data utility and cannot be used for AI/ML training.

In summary, data anonymization techniques, which remove information to protect privacy, have been established to be ineffective in completely protecting data privacy since anonymized data can be re-identified, and due to suppression, data utility is decreased. While synthetic data is data that does not directly link to real users, is 100% secure. In addition, synthetic data can be used to train more robust ML models through enhancement, bias mitigation, and rebalancing data.

Dr. Uzair Javaid
Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, specializing in programmable synthetic data generation using Generative AI and Privacy Engineering. With a Ph.D. in Computer Science from the National University of Singapore, his research has focused on blockchain-based cybersecurity solutions. He has 15+ publications and 600+ citations, and his work in data security has earned him awards and recognition. Previously, he worked at Merkle Science, developing taint analysis techniques for blockchain wallets. Dr. Javaid also advises at German Entrepreneurship Asia, supporting entrepreneurship in the region.
Related Articles

don’t let data
slow you down

Our 3 step synthetic data solution increases your business performance by 10x
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.