Synthetic Data is a subset of Generative AI (GenAI) which is used as a substitute for real data in training Machine Learning (ML) models. It is trained on real data from which untraceable datasets are created that mimic the statistical properties of real world data without having privacy leakage implications since synthetic data does not contain information about real individuals. This allows data consumers to access high-quality privacy-preserving data faster and safely.
ML models require massive amounts of data to operate. But data is hard to get by and good quality data takes time to be organized and cleaned up. Real-world data is protected by data-protection regulations such as the Personal Data Protection Act (PDPA) in Singapore and the General Data Protection Regulation (GDPR) that impose massive fines if they find a company is in breach of data privacy law(s). One of the key challenges with real data is that it contains human biases, is generally imbalanced, and is therefore prone to data drift over time.
An analysis of more than 5,000 images created with Stable Diffusion found that it takes racial and gender disparities to extremes — worse than those found in the real world. - Bloomberg
While all of these are valid problems, bias in data can render ML models completely useless. In this article, we will go through what bias is and how synthetic data can help you remove bias in ML.
Read Also: Improving Machine Learning Models with Synthetic Data
Undersampling occurs when certain classes or groups within the dataset are underrepresented. This can lead to models that perform poorly in these minority classes because they have not been adequately learned during training.
Labeling errors refer to instances where the data has been incorrectly labeled. This can introduce noise into the dataset and adversely affect model performance.
User-generated bias occurs when the actions of data analysts or engineers unintentionally introduce bias during data processing and model training.
Skewed samples occur when the dataset is disproportionately represented in certain features, leading to biased models.
Limited features in training sets refer to the insufficient or incomplete data attributes used to train ML models.
The impact of bias on AI is profound and multifaceted, affecting the accuracy, fairness, and societal trust in AI systems. Bias in AI can arise from skewed training data, biased algorithms, or prejudiced design decisions, leading to models that systematically favor certain groups over others. In 2023 the US Equal Employment Opportunity Commission (EEOC) settled a suit against iTutor for $365,000 which was accused of using AI-powered recruiting systems that automatically rejected female applicants aged 55 or older and male applicants aged 60 or older. This can manifest in various ways, such as discriminatory hiring practices, biased credit scoring, and unfair judicial outcomes, where AI systems perpetuate and even exacerbate existing social inequalities. Technically, biased models can exhibit reduced generalization capabilities, performing well on the overrepresented groups while failing on underrepresented ones. This compromises the robustness and reliability of AI applications, leading to poor decision-making and suboptimal outcomes. Additionally, biased AI systems can erode public trust, as users become wary of automated decisions perceived as unfair or discriminatory. A 2019 paper showed that black patients received lower health emergency risk scores than people with lighter skin tones. Addressing bias in AI is thus crucial for ensuring ethical, equitable, and effective AI deployment, necessitating comprehensive strategies that include diverse and representative training data, bias detection and mitigation techniques, and continuous monitoring and evaluation.
Synthetic data can play a crucial role in removing AI bias by providing balanced and representative datasets that mitigate the limitations of real-world data. Synthetic data is artificially generated rather than collected from real-world events, allowing for the creation of datasets that can be controlled and tailored to include diverse and equitable representations of various demographic groups. This helps in addressing issues like underrepresentation and limited features that often lead to biased AI models. By supplementing or replacing biased real-world data, synthetic data can ensure that machine learning models are trained on a more comprehensive and unbiased dataset. Technically, synthetic data generation techniques such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) can be used to simulate realistic and varied data points that reflect the full spectrum of potential scenarios and populations. This enhances the generalization capabilities of AI models, ensuring they perform well across different groups and conditions. Moreover, synthetic data can be used to test and validate AI systems, identifying and correcting biases before deployment. By integrating synthetic data into the training process, developers can create fairer, more robust, and trustworthy AI systems that better serve diverse populations.
Many synthetic data companies have tried to address this bias using synthetic data by increasing the number of samples of this underrepresented group. However, Betterdata’s ML team findings resonate with the ones in the ICML 2023 paper where it is proven that simply adding synthetic data without consideration of the downstream ML model will not lead to an improvement in performance. Hence, the ML algorithm will still perform poorly in the underrepresented group. Instead, Betterdata’s programmable synthetic data platform directly improves the areas of weakness in underrepresented classes of the downstream ML task by targeting its wrongly predicted samples and creating synthetic data that can improve its performance. Through this, Betterdata’s technology can consistently improve the precision rate by up to 5% (which checks for positive predictions).