Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Generate High-Utility Tabular Synthetic Data with ARF

Dr. Uzair Javaid
February 22, 2025

Table of Contents

Data without utility is just noise. To cut through this noise, Betterdata has developed a tabular synthetic data generation model which we are sure data teams will appreciate. We call it Adversarial Random Forests (ARF). 

The purpose?

Empowering enterprises to generate high-utility synthetic data, particularly for applications like fraud detection, healthcare, and finance, where balancing data privacy and utility is critical. Enabling data teams to train high-impact models that just work whether it’s data analysis, forecasting, customer behavior analysis, and more.

What is ARF?

Adversarial Random Forests (ARF) is a cutting-edge technique for generating and augmenting high-utility tabular synthetic data that closely mimics real-world datasets. By leveraging Random Forests in an adversarial setting, ARF creates statistically similar synthetic data, making it ideal for applications like fraud detection, healthcare diagnostics, and financial modeling. Additionally, ARF enables data augmentation, improving machine learning model performance by addressing data scarcity and imbalance. While primarily focused on utility, at Betterdata we implement Differential Privacy within the entire synthetic data pipeline ensuring data privacy and compliance specific to your use-case and legal framework. 

How Does ARF Work in Practice?

The ARF model operates through a series of iterative steps that involve training Random Forests and sampling synthetic data. Here’s a step-by-step breakdown of the process:

Step 1: Data Permutation and Labeling

  • Suppose you start with an original dataset of 200 rows.
  • Permute the data column by column to create a new set of 200 rows. This breaks the column correlations, effectively creating a "false" dataset.
  • Label the original data as true and the permuted data as false.
  • Combine these 400 rows (200 true + 200 false) to train the first Random Forest model, RF1.

Step 2: Sampling Synthetic Data from RF1

  • Sample 200 new rows of data from RF1’s leaves. Here’s how:
  • Randomly select one tree from RF1.
  • Choose a leaf based on the data coverage (e.g., if leaves have 100, 80, and 20 data points, the selection probabilities are 0.5, 0.4, and 0.1, respectively).
  • Extract all "true" labeled data from the chosen leaf.
  • For each column, randomly select a value:
  • For categorical columns, choose a category based on its distribution.
  • For continuous columns, estimate a Truncated Normal Distribution (TND) using the available values and samples from it.
  • Label the newly sampled 200 rows as false.

Step 3: Training RF2 and Iterating

  • Combine the original 200 true rows with the newly sampled 200 false rows to train a second Random Forest, RF2.
  • Repeat the sampling process to generate another 200 rows of synthetic data.
  • Continue this iterative process until the Random Forest can no longer distinguish between true and false data (i.e., prediction accuracy drops to 0.5).

When Does the ARF Algorithm Stop?

The ARF algorithm can stop at any point after training RF1. However, the ideal stopping criterion is when the Random Forest’s prediction accuracy for distinguishing true and false data reaches 0.5. At this point, the synthetic data is statistically indistinguishable from the real data, ensuring high utility.

ARF as a Generator Model

In the ARF framework, each Random Forest (RF1, RF2, etc.) acts as a generator model. The process of sampling false data is essentially the process of generating synthetic data. This makes ARF a powerful tool for creating datasets that can be used for machine learning, testing, and analysis without compromising privacy.

Real-World Application: IEEE Fraud Detection Dataset

To demonstrate the scalability and efficiency of ARF, consider its application on the IEEE Fraud Detection Dataset:

Training + Density Estimation Time: 3 hours 18 minutes.

Sampling 10 Million Rows: 8 hours 19 minutes.

Total Time: 11 hours 37 minutes.

This showcases ARF’s ability to handle large-scale datasets efficiently, making it a practical solution for real-world problems.

Key Benefits of Using ARF for Synthetic Data Generation

  • High Utility: ARF-generated data closely resembles the original dataset, ensuring its usefulness for analysis and modeling.
  • Scalability: ARF can handle large datasets, as demonstrated by its performance on the IEEE Fraud Detection Dataset.
  • Flexibility: It works seamlessly with both categorical and continuous data.

Why Synthetic Data:

Synthetic data is artificially generated. This means there are no PIIs present in the synthetic dataset. 

No PII = No Privacy Loss

This makes it exponentially easier for enterprises to use and share data freely and without compromise accelerating the pace of innovation, and empowering organizations to compete in the global tech and AI race.

Betterdata’s programmatic synthetic data generation models are designed to,

  1. Closely mimic real data statistics and distributions
  2. Augment data for scarce datasets or edge cases
  3. Allow seamless integration and scalability depending on the organization’s data needs
  4. And we give quantifiable privacy guarantees.

This opens up previously closed doors for organizations that can now safely and effectively utilize data for AI/ML training, data analysis, digital transformation, innovation, and so on.

Dr. Uzair Javaid
Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, specializing in programmable synthetic data generation using Generative AI and Privacy Engineering. With a Ph.D. in Computer Science from the National University of Singapore, his research has focused on blockchain-based cybersecurity solutions. He has 15+ publications and 600+ citations, and his work in data security has earned him awards and recognition. Previously, he worked at Merkle Science, developing taint analysis techniques for blockchain wallets. Dr. Javaid also advises at German Entrepreneurship Asia, supporting entrepreneurship in the region.
Related Articles

don’t let data
slow you down

Our 3 step synthetic data solution increases your business performance by 10x
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.