Data without utility is just noise. To cut through this noise, Betterdata has developed a tabular synthetic data generation model which we are sure data teams will appreciate. We call it Adversarial Random Forests (ARF).
The purpose?
Empowering enterprises to generate high-utility synthetic data, particularly for applications like fraud detection, healthcare, and finance, where balancing data privacy and utility is critical. Enabling data teams to train high-impact models that just work whether it’s data analysis, forecasting, customer behavior analysis, and more.
What is ARF?
Adversarial Random Forests (ARF) is a cutting-edge technique for generating and augmenting high-utility tabular synthetic data that closely mimics real-world datasets. By leveraging Random Forests in an adversarial setting, ARF creates statistically similar synthetic data, making it ideal for applications like fraud detection, healthcare diagnostics, and financial modeling. Additionally, ARF enables data augmentation, improving machine learning model performance by addressing data scarcity and imbalance. While primarily focused on utility, at Betterdata we implement Differential Privacy within the entire synthetic data pipeline ensuring data privacy and compliance specific to your use-case and legal framework.
How Does ARF Work in Practice?
The ARF model operates through a series of iterative steps that involve training Random Forests and sampling synthetic data. Here’s a step-by-step breakdown of the process:
Step 1: Data Permutation and Labeling
- Suppose you start with an original dataset of 200 rows.
- Permute the data column by column to create a new set of 200 rows. This breaks the column correlations, effectively creating a "false" dataset.
- Label the original data as true and the permuted data as false.
- Combine these 400 rows (200 true + 200 false) to train the first Random Forest model, RF1.
Step 2: Sampling Synthetic Data from RF1
- Sample 200 new rows of data from RF1’s leaves. Here’s how:
- Randomly select one tree from RF1.
- Choose a leaf based on the data coverage (e.g., if leaves have 100, 80, and 20 data points, the selection probabilities are 0.5, 0.4, and 0.1, respectively).
- Extract all "true" labeled data from the chosen leaf.
- For each column, randomly select a value:
- For categorical columns, choose a category based on its distribution.
- For continuous columns, estimate a Truncated Normal Distribution (TND) using the available values and samples from it.
- Label the newly sampled 200 rows as false.
Step 3: Training RF2 and Iterating
- Combine the original 200 true rows with the newly sampled 200 false rows to train a second Random Forest, RF2.
- Repeat the sampling process to generate another 200 rows of synthetic data.
- Continue this iterative process until the Random Forest can no longer distinguish between true and false data (i.e., prediction accuracy drops to 0.5).
When Does the ARF Algorithm Stop?
The ARF algorithm can stop at any point after training RF1. However, the ideal stopping criterion is when the Random Forest’s prediction accuracy for distinguishing true and false data reaches 0.5. At this point, the synthetic data is statistically indistinguishable from the real data, ensuring high utility.
ARF as a Generator Model
In the ARF framework, each Random Forest (RF1, RF2, etc.) acts as a generator model. The process of sampling false data is essentially the process of generating synthetic data. This makes ARF a powerful tool for creating datasets that can be used for machine learning, testing, and analysis without compromising privacy.
Real-World Application: IEEE Fraud Detection Dataset
To demonstrate the scalability and efficiency of ARF, consider its application on the IEEE Fraud Detection Dataset:
Training + Density Estimation Time: 3 hours 18 minutes.
Sampling 10 Million Rows: 8 hours 19 minutes.
Total Time: 11 hours 37 minutes.
This showcases ARF’s ability to handle large-scale datasets efficiently, making it a practical solution for real-world problems.
Key Benefits of Using ARF for Synthetic Data Generation
- High Utility: ARF-generated data closely resembles the original dataset, ensuring its usefulness for analysis and modeling.
- Scalability: ARF can handle large datasets, as demonstrated by its performance on the IEEE Fraud Detection Dataset.
- Flexibility: It works seamlessly with both categorical and continuous data.
Why Synthetic Data:
Synthetic data is artificially generated. This means there are no PIIs present in the synthetic dataset.
This makes it exponentially easier for enterprises to use and share data freely and without compromise accelerating the pace of innovation, and empowering organizations to compete in the global tech and AI race.
Betterdata’s programmatic synthetic data generation models are designed to,
- Closely mimic real data statistics and distributions
- Augment data for scarce datasets or edge cases
- Allow seamless integration and scalability depending on the organization’s data needs
- And we give quantifiable privacy guarantees.
This opens up previously closed doors for organizations that can now safely and effectively utilize data for AI/ML training, data analysis, digital transformation, innovation, and so on.