Introducing Incremental Relational Generator. An advanced approach to synthetic relational data generation, through deep learning, to first understand relational database structure and then produce synthetic data with accuracy and scalability, preserving database integrity.
IRG is the latest addition to our collection of state-of-the-art synthetic data generation and augmentation models, i.e., ARF, TabTreeFormer, and TAEGAN. To understand their application for your specific use cases, request a call with our data team.
1. The Challenge with Relational Synthetic Data:
Data cannot work in a silo. For it to make sense, it has to be connected to 2 or more data points forming a relation. These relational databases are the building blocks of modern analytical and management systems, from e-commerce platforms analyzing orders to healthcare systems tracking patient records. These databases rely on precise relationships between tables, governed by constraints like primary keys (PKs) and foreign keys (FKs) to ensure data consistency.
Generating relational synthetic data that accurately mirrors these interconnected structures, however, has long been a challenge. Traditional methods struggle with critical schema complexities: composite keys (e.g., multi-column identifiers), nullable foreign keys (e.g., optional references between tables), and sequential dependencies (e.g., time-series logs). Resulting in datasets that violate schema constraints, making them unusable for real-world applications. IRG solves this by leveraging deep learning to preserve relational schema integrity, ensuring synthetic databases mirror even the most complex constraints while capturing nuanced table relationships.
2. How Incremental Relational Generator Works:
Incremental Table Generation:
IRG generates tables step-by-step, with each new table informed by previously created ones. This ensures dependencies (like FK constraints) are honored. For example, a "customer orders" table is generated only after its parent "customers" table exists.
Context-Aware Deep Learning:
Using a modified conditional GAN framework, IRG generates synthetic data for each table while conditioning on relevant parent rows. This means a "product reviews" table is generated in context with its linked "products" and "users" tables, preserving relationships.
Sequential Dependency Modeling:
For time-series data (e.g., payment logs), IRG uses a conditional time-series model to capture patterns like order timestamps or user activity sequences.
Composite and Nullable Key Support:
Unlike prior methods, IRG handles overlapping composite keys (e.g., a PK made of two FKs) and nullable FKs (e.g., an "assisted_by" column that can be empty).
Guaranteed Constraint Accuracy:
IRG ensures 100% validity for PKs and FKs, ensuring there are no duplicates or broken links. This is critical for analytics-ready synthetic data.
3. Advantages Over Existing Methods:
Handles Complex Schemas:
Supports composite keys, cyclic dependencies, and tables with multiple parents (e.g., a "step-sibling" table sharing a parent).
Scalability:
By decomposing tasks into smaller subtasks, IRG efficiently scales to databases with millions of rows.
Sequential Data:
Captures time-series patterns, like purchase histories, without manual tuning.
Privacy Preservation:
Avoids overfitting to exact key distributions, reducing leakage risks.
4. Experimental Results:
We ran experiments on 3 relational databases with varying scales and fields with sufficiently complex schema and different categorical and continuous columns other than IDs, comparing IRG with leading models such as IND, HMA, and RC-TGAN. In all three experiments, IRG outperformed all models in relational synthetic data generation.
Football Database:
Challenge: Composite PKs (e.g., game_id + player_id in appearances) and nullable FKs (e.g., assister_id in shots).
IRG’s Solution: Enforced 100% uniqueness for composite keys and valid nullable references.
Outcome: Generated 5,000 games without errors; outperformed competitors in 8/10 analytical queries.
Brazilian E-Commerce Database:
Challenge: Composite PKs with serial IDs (e.g., order_id + item_number).
IRG’s Solution: Guaranteed 100% valid serial IDs, avoiding duplicates.
Outcome: Outperformed IND in metrics like order-item distributions while other competitors crashed during training.
Beatport Tracks Database:
Challenge: Million-row tables with composite keys (e.g., artist_id + track_id).
IRG’s Solution: Scaled seamlessly, avoiding memory bottlenecks.
Outcome: Achieved 100% constraint compliance; statistical metrics (K-S < 0.2, Wasserstein < 0.05) mirrored real data.
5. Conclusion:
IRG represents a paradigm shift in relational synthetic data generation. By using deep learning to understand complex database schemas, IRG can generate realistic relational synthetic data at scale for enterprises to innovate seamlessly, ensuring synthetic data isn’t just realistic—it’s relational, scalable, and constraint-perfect.
To Read the complete research paper click here.