Interest in synthetic tabular data generation has exploded within the past couple of years and rightly so. The ability to generate high-quality synthetic tabular data has improved greatly with many startups, research labs, and organizations working on creating high-performing models to generate synthetic data. This has enabled organizations to quickly access real-like data which is both private and reliable and use it for machine learning, AI development, data analysis, data augmentation, software testing, and more.
However, most research on synthetic data generation has been on larger datasets and evaluating its performance on metrics such as column-wise statistical distributions and inter-feature correlations.
The drawback?
The utility of synthetic data for smaller datasets especially ones where data is scarce is often overlooked.
What is TAEGAN?
Large language models (LLMs) have set the benchmark for synthetic tabular data generation, although their massive scale and complexity are highly excessive for smaller datasets. Tabular Auto-Encoder Generative Adversarial Network (TAEGAN), is a new and one-of-a-kind GAN-based framework designed to generate high-quality tabular data with efficiency and precision developed by the AI and research team at Betterdata.
What sets TAEGAN apart is its use of a masked auto-encoder as the generator, a groundbreaking approach that brings the power of self-supervised pre-training to tabular data generation. TAEGAN effectively exposes the network to more information, enabling it to produce synthetic data that is both realistic and reliable without relying on larger, more complex models.
By “seeing” more information during training, TAEGAN balances performance and complexity making it especially effective for small or scarce datasets while still delivering top-tier synthetic data.
TAEGAN Excels in Synthetic Data Augmentation:
When it comes to synthetic tabular data generation for small datasets, TAEGAN has proven to be exceptional, especially for data augmentation.
Evaluating Data Augmentation:
To assess the effectiveness of synthetic data augmentation, we conducted experiments using a combination of original and synthetic data for training. The synthetic data matched the size of the original dataset, while the test set remained untouched and composed solely of real data to ensure no data leakage. We used XGBoost as the machine learning model and tested on 8 small benchmark tabular datasets (with fewer than 2,000 rows) from OpenML, which are ideal candidates for augmentation.
We also evaluated synthetic data quality using a "train-on-synthetic, test-on-real" framework, extending the evaluation to include 2 larger datasets. TAEGAN was compared against several leading models, including:
- ARF: A non-neural network method.
- CTAB-GAN+: A GAN-based approach.
- TVAE: A variational autoencoder (VAE) method.
- TabDDPM: A diffusion model.
- REaLTabFormer: An LLM-based model.
Not all synthetic data generation models guarantee a positive data augmentation effect. However, TAEGAN consistently improved machine learning performance across all tested datasets, achieving the best augmentation results on 7 out of the 8 small datasets. This highlights TAEGAN’s ability to effectively augment small datasets, making it a powerful tool for scenarios where data scarcity is a challenge.
While TAEGAN excels in data augmentation, it’s worth noting that its performance on other synthetic data quality metrics, such as machine learning efficacy, doesn’t always match that of LLM-based methods. This suggests that while GANs like TAEGAN may not capture internal data relationships as effectively as LLMs, they avoid being overly complex. With model sizes of only 2-5% of LLMs, TAEGAN balanced performance and efficiency, making it ideal for small datasets.
Evaluating Data Quality:
In terms of synthetic data quality, TAEGAN outperformed all other deep learning methods across a variety of datasets and metrics. However, when compared to LLM-based models like REaLTabFormer, TAEGAN doesn’t always come out on top, especially for larger datasets. This is expected, as LLMs thrive on abundant data and have significantly more parameters (around 40M) compared to TAEGAN’s compact 1-2M parameter network.
That said, LLM-based methods like REaLTabFormer come with their own challenges, particularly around privacy. These models risk reproducing data records identical to real data, raising concerns about data security. TAEGAN, on the other hand, avoids these while still delivering exceptional results.
Ending Thoughts:
TAEGAN’s success lies in its ability to generate high-quality synthetic data without the complexity and resources demanded by LLMs. TAEGAN typically delivered superior performance while maintaining a leaner network, often just a fraction of the parameter size used by large language models (LLMs).
In every dataset tested, TAEGAN’s synthetic data improved machine learning outcomes, achieving the best results on 7 out of these 8 datasets showcasing a clear advantage over other synthetic generation models. Beyond data augmentation, TAEGAN also exceled in terms of overall data quality, as measured by training models on synthetic data and testing on real data.
In short, for synthetic data generation when you need efficient, high-quality augmentation for smaller tabular datasets, you now know who to call.
Read the full research paper here: TAEGAN: GENERATING SYNTHETIC TABULAR DATA FOR DATA AUGMENTATION