Dr. Uzair Javaid

Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, a company focused on Programmable Synthetic Data generation using Generative AI and Privacy Engineering. Betterdata’s technology helps data science and engineering teams easily access and share sensitive customer/business data while complying with global data protection and AI regulations.
Previously, Uzair worked as a Software Engineer and Business Development Executive at Merkle Science (Series A $20M+), where he worked on developing taint analysis techniques for blockchain wallets. 

Uzair has a strong academic background in Computer Science/Engineering with a Ph.D. from National University of Singapore (Top 10 in the world). His research focused on designing and analyzing blockchain-based cybersecurity solutions for cyber-physical systems with specialization in data security and privacy engineering techniques. 

In one of his PhD. projects, he reverse engineered the encryption algorithm of Ethereum blockchain and ethically hacked 670 user wallets. He has been cited 600+ times across 15+ publications in globally reputable conferences and journals, and has also received recognition for his work including Best Paper Award and Scholarships. 

In addition to his work at Betterdata AI, Uzair is also an advisor at German Entrepreneurship Asia, providing guidance and expertise to support entrepreneurship initiatives in the Asian region. He has been actively involved in paying-it-forward as well, volunteering as a peer student support group member at National University of Singapore and serving as a technical program committee member for the International Academy, Research, and Industry Association.

Using Incremental Relational Generator to Generate Synthetic Data from Relational Databases

Dr. Uzair Javaid
April 22, 2025

Table of Contents

Introducing Incremental Relational Generator. An advanced approach to synthetic relational data generation, through deep learning, to first understand relational database structure and then produce synthetic data with accuracy and scalability, preserving database integrity.

IRG is the latest addition to our collection of state-of-the-art synthetic data generation and augmentation models, i.e., ARF, TabTreeFormer, and TAEGAN. To understand their application for your specific use cases, request a call with our data team.

1. The Challenge with Relational Synthetic Data:

Data cannot work in a silo. For it to make sense, it has to be connected to 2 or more data points forming a relation. These relational databases are the building blocks of modern analytical and management systems, from e-commerce platforms analyzing orders to healthcare systems tracking patient records. These databases rely on precise relationships between tables, governed by constraints like primary keys (PKs) and foreign keys (FKs) to ensure data consistency. 

Generating relational synthetic data that accurately mirrors these interconnected structures, however, has long been a challenge. Traditional methods struggle with critical schema complexities: composite keys (e.g., multi-column identifiers), nullable foreign keys (e.g., optional references between tables), and sequential dependencies (e.g., time-series logs). Resulting in datasets that violate schema constraints, making them unusable for real-world applications. IRG solves this by leveraging deep learning to preserve relational schema integrity, ensuring synthetic databases mirror even the most complex constraints while capturing nuanced table relationships.

2. How Incremental Relational Generator Works:

Incremental Table Generation:

IRG generates tables step-by-step, with each new table informed by previously created ones. This ensures dependencies (like FK constraints) are honored. For example, a "customer orders" table is generated only after its parent "customers" table exists.

Context-Aware Deep Learning:

Using a modified conditional GAN framework, IRG generates synthetic data for each table while conditioning on relevant parent rows. This means a "product reviews" table is generated in context with its linked "products" and "users" tables, preserving relationships.

Sequential Dependency Modeling:

For time-series data (e.g., payment logs), IRG uses a conditional time-series model to capture patterns like order timestamps or user activity sequences.

Composite and Nullable Key Support:

Unlike prior methods, IRG handles overlapping composite keys (e.g., a PK made of two FKs) and nullable FKs (e.g., an "assisted_by" column that can be empty).

Guaranteed Constraint Accuracy:

IRG ensures 100% validity for PKs and FKs, ensuring there are no duplicates or broken links. This is critical for analytics-ready synthetic data.

3. Advantages Over Existing Methods:

Handles Complex Schemas:

Supports composite keys, cyclic dependencies, and tables with multiple parents (e.g., a "step-sibling" table sharing a parent).

Scalability:

By decomposing tasks into smaller subtasks, IRG efficiently scales to databases with millions of rows.

Sequential Data:

Captures time-series patterns, like purchase histories, without manual tuning.

Privacy Preservation:

Avoids overfitting to exact key distributions, reducing leakage risks.

4. Experimental Results: 

We ran experiments on 3 relational databases with varying scales and fields with sufficiently complex schema and different categorical and continuous columns other than IDs, comparing IRG with leading models such as IND, HMA, and RC-TGAN. In all three experiments, IRG outperformed all models in relational synthetic data generation.

Football Database:

Challenge: Composite PKs (e.g., game_id + player_id in appearances) and nullable FKs (e.g., assister_id in shots).

IRG’s Solution: Enforced 100% uniqueness for composite keys and valid nullable references.

Outcome: Generated 5,000 games without errors; outperformed competitors in 8/10 analytical queries.

Brazilian E-Commerce Database:

Challenge: Composite PKs with serial IDs (e.g., order_id + item_number).

IRG’s Solution: Guaranteed 100% valid serial IDs, avoiding duplicates.

Outcome: Outperformed IND in metrics like order-item distributions while other competitors crashed during training.

Beatport Tracks Database:

Challenge: Million-row tables with composite keys (e.g., artist_id + track_id).

IRG’s Solution: Scaled seamlessly, avoiding memory bottlenecks.

Outcome: Achieved 100% constraint compliance; statistical metrics (K-S < 0.2, Wasserstein < 0.05) mirrored real data.

5. Conclusion:

IRG represents a paradigm shift in relational synthetic data generation. By using deep learning to understand complex database schemas, IRG can generate realistic relational synthetic data at scale for enterprises to innovate seamlessly, ensuring synthetic data isn’t just realistic—it’s relational, scalable, and constraint-perfect.

To Read the complete research paper click here.

Dr. Uzair Javaid
Dr. Uzair Javaid is the CEO and Co-Founder of Betterdata AI, specializing in programmable synthetic data generation using Generative AI and Privacy Engineering. With a Ph.D. in Computer Science from the National University of Singapore, his research has focused on blockchain-based cybersecurity solutions. He has 15+ publications and 600+ citations, and his work in data security has earned him awards and recognition. Previously, he worked at Merkle Science, developing taint analysis techniques for blockchain wallets. Dr. Javaid also advises at German Entrepreneurship Asia, supporting entrepreneurship in the region.
Related Articles

don’t let data
slow you down

Our 3 step synthetic data solution increases your business performance by 10x
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.