Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

High-Quality Synthetic Tabular Data Augmentation with TAEGAN

Dr. Uzair Javaid
February 14, 2025

Table of Contents

Interest in synthetic tabular data generation has exploded within the past couple of years and rightly so. The ability to generate high-quality synthetic tabular data has improved greatly with many startups, research labs, and organizations working on creating high-performing models to generate synthetic data. This has enabled organizations to quickly access real-like data which is both private and reliable and use it for machine learning, AI development, data analysis, data augmentation, software testing, and more. 

However, most research on synthetic data generation has been on larger datasets and evaluating its performance on metrics such as column-wise statistical distributions and inter-feature correlations. 

The drawback?

The utility of synthetic data for smaller datasets especially ones where data is scarce is often overlooked. 

What is TAEGAN? 

Large language models (LLMs) have set the benchmark for synthetic tabular data generation, although their massive scale and complexity are highly excessive for smaller datasets. Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), is a new and one-of-a-kind GAN-based framework designed to generate high-quality tabular data with efficiency and precision developed by the AI and research team at Betterdata.

What sets TAEGAN apart is its use of a masked auto-encoder as the generator, a groundbreaking approach that brings the power of self-supervised pre-training to tabular data generation. TAEGAN effectively exposes the network to more information, enabling it to produce synthetic data that is both realistic and reliable without relying on larger, more complex models.

By “seeing” more information during training, TAEGAN balances performance and complexity making it especially effective for small or scarce datasets while still delivering top-tier synthetic data.

Architecture of TAEGAN network
Architecture of TAEGAN network

TAEGAN Excels in Synthetic Data Augmentation:

When it comes to synthetic tabular data generation for small datasets, TAEGAN has proven to be exceptional, especially for data augmentation.

Evaluating Data Augmentation:

To assess the effectiveness of synthetic data augmentation, we conducted experiments using a combination of original and synthetic data for training. The synthetic data matched the size of the original dataset, while the test set remained untouched and composed solely of real data to ensure no data leakage. We used XGBoost as the machine learning model and tested on 8 small benchmark tabular datasets (with fewer than 2,000 rows) from OpenML, which are ideal candidates for augmentation.

We also evaluated synthetic data quality using a "train-on-synthetic, test-on-real" framework, extending the evaluation to include 2 larger datasets. TAEGAN was compared against several leading models, including:

  • ARF: A non-neural network method.
  • CTAB-GAN+: A GAN-based approach.
  • TVAE: A variational autoencoder (VAE) method.
  • TabDDPM: A diffusion model.
  • REaLTabFormer: An LLM-based model.

Not all synthetic data generation models guarantee a positive data augmentation effect. However, TAEGAN consistently improved machine learning performance across all tested datasets, achieving the best augmentation results on 7 out of the 8 small datasets. This highlights TAEGAN’s ability to effectively augment small datasets, making it a powerful tool for scenarios where data scarcity is a challenge.

While TAEGAN excels in data augmentation, it’s worth noting that its performance on other synthetic data quality metrics, such as machine learning efficacy, doesn’t always match that of LLM-based methods. This suggests that while GANs like TAEGAN may not capture internal data relationships as effectively as LLMs, they avoid being overly complex. With model sizes of only 2-5% of LLMs, TAEGAN balanced performance and efficiency, making it ideal for small datasets.

Augmentation performances of different synthetic tabular data generation models
Augmentation performances of different synthetic tabular data generation models

Evaluating Data Quality:

In terms of synthetic data quality, TAEGAN outperformed all other deep learning methods across a variety of datasets and metrics. However, when compared to LLM-based models like REaLTabFormer, TAEGAN doesn’t always come out on top, especially for larger datasets. This is expected, as LLMs thrive on abundant data and have significantly more parameters (around 40M) compared to TAEGAN’s compact 1-2M parameter network.

That said, LLM-based methods like REaLTabFormer come with their own challenges, particularly around privacy. These models risk reproducing data records identical to real data, raising concerns about data security. TAEGAN, on the other hand, avoids these while still delivering exceptional results.

Data quality by machine learning performance of train-on-synthetic-test-on-real strategy
Data quality by machine learning performance of train-on-synthetic-test-on-real strategy

Ending Thoughts:

TAEGAN’s success lies in its ability to generate high-quality synthetic data without the complexity and resources demanded by LLMs. TAEGAN typically delivered superior performance while maintaining a leaner network, often just a fraction of the parameter size used by large language models (LLMs).

In every dataset tested, TAEGAN’s synthetic data improved machine learning outcomes, achieving the best results on 7 out of these 8 datasets showcasing a clear advantage over other synthetic generation models. Beyond data augmentation, TAEGAN also exceled in terms of overall data quality, as measured by training models on synthetic data and testing on real data.

In short, for synthetic data generation when you need efficient, high-quality augmentation for smaller tabular datasets, you now know who to call.

Read the full research paper here: TAEGAN: GENERATING SYNTHETIC TABULAR DATA FOR DATA AUGMENTATION

Dr. Uzair Javaid
Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, specializing in programmable synthetic data generation using Generative AI and Privacy Engineering. With a Ph.D. in Computer Science from the National University of Singapore, his research has focused on blockchain-based cybersecurity solutions. He has 15+ publications and 600+ citations, and his work in data security has earned him awards and recognition. Previously, he worked at Merkle Science, developing taint analysis techniques for blockchain wallets. Dr. Javaid also advises at German Entrepreneurship Asia, supporting entrepreneurship in the region.
Related Articles

don’t let data
slow you down

Our 3 step synthetic data solution increases your business performance by 10x
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.