Solving the Data Challenge for Autonomous Driving

The road to fully autonomous driving has its share of challenges, but today’s commitment to quality and accuracy ultimately leads to the safe, connected vehicles of tomorrow.

Ensuring Quality Data for Autonomous Vehicles

Watch this on-demand webinar for tips to define, measure and guarantee data quality.

Section 1

Developing High-Performance Automated Driving Technology

Self-driving by 2020 was the common goal among top automakers, but there’s been a noticeable shift in the self-driving car conversation since late 2016. A recent article cites that while 11 of the world’s largest automakers agree self-driving is inevitable, delivering on the promise of self-driving cars will take more time than expected. 

Performance and safety validation is top of mind for teams who have achieved level 3 and level 4 autonomy, with three common pitfalls contributing to the delay in autonomous vehicle development: complexity of sensor package data, volume and diversity of data required and level of accuracy needed for high-performing vision systems. 

Samasource - Illustration-01-1

Complexity of Sensor Packages

There’s still some debate over what sensors are most effective at capturing the way autonomous vehicles “see” the world around us. 

For Level 1 and 2 vehicles, cameras and radar sensors remain essential for eliminating blind spots or measuring the distance of other vehicles and obstacles. For later stage vehicle development, automakers struggle against the cost and processing power required to maintain cohesion of camera, radar and lidar data i.e. sensor fusion. 

Tour Our Data Annotation Platform, SamaHub

Regardless of development phase, a primary challenge in automotive software model development is understanding the full range of edge cases. How do you deal with light coming into the camera? What if there’s a smaller object you need to detect on a highway i.e. a brick? How do you manage sensor redundancy in case one goes out? How do you deal with non-sensor issues e.g. a large metal object will be perceived differently with different sensors? 

Sensor fusion helps the vehicle precisely and comprehensively understand the environment around it, but each sensor has its advantages and disadvantages. For example, an ultrasonic sensor can estimate the position of a static vehicle, but it has a slow refresh rate and can only cover a range of a few meters. When this is blended with lidar, the max range, as well as the refresh rate improves the proximity sensing capability of the vehicle.  

Volume and Diversity of Training Data Required 

Autonomous vehicles and systems rely on training data to learn about conditions they are likely to encounter and how to react to them. Because of this, the more high-quality data the machine learning model has to train on, the better. 

McKinsey Global Institute found that 1 out of 3 AI systems require model refresh at least monthly and sometimes daily. And while acquiring data can be a challenge, a well-scoped technology roadmap can help you maintain focus, while leaving room to be agile.

A practical training data strategy should identify when an existing dataset makes the most sense, and when it’s necessary to gather real-world experience. For example, training the vehicle to anticipate inherent human behavior like a pedestrian crossing outside of the crosswalk, or making a last minute turn without signaling, comes from real-world exposure to such scenarios. In the same respect, self-driving cars have been known to perceive steam or exhaust as another vehicle, causing the car to make sudden, unnecessary stops.  

Samasource - Illustration-02-1

Having worked with companies like Ford and GM to deliver turnkey, high-quality training data and model validation for self-driving cars, our industry expertise includes recommending annotation best practices and helping clients identify gaps in instructions. We’ve found that the best way to gain a full understanding of the volume and diversity of data needed is to learn as you go. This includes getting a better understanding of the data needed to address edge cases 

High Accuracy Needed for High-Performing Vision Systems

Clean data is the foundation for trusted machine learning models, and autonomous vehicles are no exception. 

Amnon Shashua, CEO of Mobileye, expressed a self-driving car’s perception system should fail, at an absolute maximum, once in every 10 million hours of driving. But currently, the best self-driving assistance systems incorrectly perceive something in their environment once every tens of thousands of hours. This level of accuracy is a step toward building safe, autonomous vehicles, but it’s still not enough to perceive the road better than the best human driver.

At 95 percent accuracy, you may encounter very few nuances in your data, but as you move toward 99 percent, maintaining high accuracy only gets more complex. Reevaluating your workflows at each quality threshold can assist in managing data quality at scale. Consider the following when working to improve the accuracy of your model:

  • Identify gaps in instructions and adjust data labeling requirements as needed.
  • Consult with a trusted data annotation partner on data labeling best practices.
  • Make focused shifts to address edge cases as they occur.
  • Adjust taxonomy and quality rubrics during the project to reflect current training goals.

In cases where your model is weak, treat the occurrence like a bug that needs to be fixed and continuously evolve your model from there. This agile approach to the ML lifecycle has helped us achieve upwards of 99.6% quality SLAs for automotive use cases.

Moving Toward Level 4 Autonomous Driving

Section 2

How to Build a World-Class Training Data Pipeline

From lack of diverse data to understanding which sensors to use and when—developing high-performance self-driving systems is an uphill climb. A key success factor in helping driverless cars understand the unspoken rules of human driving behavior is the quality of your training data. 

Samasource - Illustration-03-1

The first step to build a training data pipeline for autonomous driving is to select relevant data. This includes accounting for edge cases, as well as ensuring you have enough data to validate  new features before proceeding to test solutions in the real world. 

Duncan Curtis, Samasource VP of Product shares approximately 1,000 to 10,000 images are needed to deploy a solution onto a vehicle, and in order to launch a product publicly, approximately 10,000 to 100,000 images are needed per ADAS feature. 

Not all data collected will be relevant to the training dataset needed for your model, and for completely autonomous vehicles, the amount of data needed for edge case detection is unknown. Take into account the types of sensors you’re working with and what the sensors are being used for, to make informed decisions that keep your data pipeline full.

Next, it’s important to precisely define labels of interest per image, so it’s clear what needs to be annotated. AV algorithms rely on comprehensive and actionable instructions to interpret the world around us, and this leaves little room for inconsistencies in taxonomy, meta-data labels, annotation tools, etc. 

Samasource - Illustration-04-1

Your taxonomy should identify edge cases such as bicycles versus tricycles, as well as identify sub-attributes of interest. Relevant meta-data labels help contextualize and categorize your dataset, making it easier to identify gaps and quickly refresh datasets for new and unique driving conditions.

Just as sensor types need to be taken into account when collecting data, choosing the right annotation tool can help reliably produce more accurate labels e.g. polygons allow you to define boundaries more precisely than 2D boxes. 

Designing your labeling workflows to maximize accuracy might include simplifying the number of labels per image to minimize errors or breaking up complex workflows into several different steps, then merging the results at the end.  

Developing a shared, quantified understanding of data quality in your organization is essential to building a world-class training data pipeline, and a robust quality rubric can help keep everyone on the same page. A few things to include in your quality rubric are:

  • Definition of a critical or minor error (eg: wrong class vs pixel not included)
  • Defined pixel tolerance to ensure your data is labeled as accurately as needed, including minimum pixel threshold
  • Process to evaluate quality across workflows and tasks

In addition to a quality rubric, establishing a robust quality assurance process can assist in identifying and quickly addressing gaps in instructions and edge cases. A few best practices we’ve put in place for our clients include automated QA checks, advanced quality analytics reports and other gold tasks to meet and exceed quality SLAs. 

Having a dedicated team for data labeling also helps, given the cumulative expertise gained on specific subject matter and tasks during a project. Our industry-leading QA process has helped clients identify gaps in instructions for some of their most complex edge cases, as well as achieve full productivity in as little as two to five days.

How to Ensure Data Quality

Access our checklist on how to ensure quality training data for ML models.

Get It Now

Section 3

Bringing Safe, Self-Driving Vehicles to Market

“When you’re working on the large scale deployment of mission critical safety systems, the mindset of ‘move fast and break things certainly doesn’t cut it.”

- Dan Ammann,  CEO, Cruise 


Agility, accuracy and trust are key to developing automated driving technology. As a data annotation partner with over a decade of experience and deep expertise in the autonomous transportation industry, here are our some recommendations for bringing safe, self-driving cars to market. 

  1. Don’t underestimate the complexity of your data. There are varying aspects of complexity in model development, from the region of the world your data is captured, to the number and types of labels for 2D, 3D and video annotation. All of these factors are essential, and it’s important not to underestimate the complexity of your project. 
  2. Develop and refine in parallel. Starting with fundamentals like staying between lane lines, changing lanes or merging onto a highway doesn’t mean you can’t simultaneously make each feature more capable. Build on basic training scenarios i.e. focus on a human-like trajectory and speed for lane changing with no other cars around, then practice lane changing with other cars around until the vehicle can nudge its way into a lane similar to a human driver.

Be steadfast in testing and validation. Your machine learning algorithm needs to study a lot of examples to get to 90 percent, 95 percent, 99 percent quality and so on. At a certain point, getting on the road to validate computer learnings is the only way to improve self-driving systems.

The Cost of a False Positive

Assess our solution brief on the cost and consequence of poor quality training data.

Get It Now

Ground Truth Data for Autonomous Transportation

Quickly develop high-performing automated driving technology with high-quality training data from Samasource. 

Contact us for help with training data strategy, model validation, workflow configuration data enrichment and more.