4 Training Data Strategies to Avoid Bias

Audrey Boguchwal

May 20, 2019

4 minute read

According to a new McKinsey Global Survey, adoption of artificial intelligence continues to advance, but the same foundational barriers still remain when trying to create value from AI at scale.

Among the challenges to successfully adopt AI is bias in data and algorithms. Audrey BoguchwalSenior Product Manager at Samasource shares four practical approaches to training data strategy that can help AI teams avoid the effects of training data bias. 

The Impact of Biased Data


Models trained on biased data can be less accurate, resulting in insufficient training for your algorithm. Recent studies have shown that biased data can result in problems with facial recognition used in identification, surveillance, and law enforcement. Biased data can also perpetuate historical, negative stereotypes across race and gender. 

Ensuring that reality is always represented in your data is a constructive way to minimize the impact of data bias, however, a clear training data strategy to legally and ethically source the data AI requires is fundamental for developing smarter models. 

Avoid Data Bias with a Practical Training Data Strategy

  1. Clearly articulate your end training goal and know what data is needed to get to it. When you start with the end goal in mind, you're primed to think through the skill set, tool set and milestones needed to achieve your training goal. For example, for object classifiers, your training data strategy might include preprocessing data to capture bias or offset dataset bias. You recognize that images may look similar before you begin data collection and make a plan to consider transformations i.e. flip or automatically crop images so they vary. 
  2. Map out ways bias can enter data and proactively source data to avoid it. Keep in mind that humans are inherently biased, so eliminating all forms of bias is near impossible. Rigorously examine your own biases and the biases of those providing data/information to you.
  3. Avoid selection bias by varying search terms and data sources, and avoid negative set bias by varying data i.e. collect data that contains background scenes in addition to objects of interest. Test before and after training, on a wide range of data. If in the end, you find your model to be of low variance and high bias, use cross-validation to influence the degree of flexibility of your model. Methods like cross dataset generalization can also help determine how reliant your model is on the "native" dataset, when compared to other representative datasets.
  4. Ensure data represents reality for your training goal in quantity and diversity, and replenish data often. High-tech, automotive, retail-- there isn't a single industry adopting AI that shows signs of stagnating growth. Refresh data often to stay ahead of trends. Use more than one training set, especially if it's a stock set. Success comes from being iterative, source and label new data as the world changes.

An effective training data strategy can help you determine ways to mitigate unwanted bias. Audrey Boguchwal, Senior Product Manager at Samasource will present "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations this year at the 2019 Embedded Vision Summit conference. 

Audrey's presentation will expand on the four strategies to avoid training data bias in this post by exploring use-cases that show how unintended bias can creep into datasets, sharing tests to detect dataset bias, and outlining legal and ethical data sourcing considerations.

If you'll be attending Embedded Vision Summit, stop by booth #621 to discuss your training data needs with the Samasource team, or click below to request a demo of our cloud-based data annotation platform, SamaHub.

Request Demo

Audrey Boguchwal

Currently a Senior Product Manager at Samasource, Audrey guides cross-functional teams to create thoughtful product solutions. She has guided teams of designers and engineers at HUGE Inc. and NBCUniversal, and monitored user analytics at the Wall Street Journal. With a BA in history from Harvard, an MA in anthropology from Columbia and an MBA from UNC Chapel Hill KFBS, Audrey is passionate a using technology and data analytics facilitate social impact and environmental solutions through technology.