12 Open Source Datasets for Machine Learning

Sharon L. Hadden

June 7, 2019

4 minute read


Artificial intelligence is a double-edged sword--on one hand, homes are smarter, health tech is advancing at rapid pace, and driverless vans will soon deliver your groceries.

On the other hand, privacy violations, discrimination and a whole host of effects not yet known or experienced give pause to these technologies.

Confronting the risks of AI begins with facing your data difficulties, including ingesting high-quality data before sorting, linking and programming even occurs.

Here are eight open source datasets for machine learning and three dataset finders, including one that will be featured in the Fine-Grained Visual Categorization (FGVC) workshop at CVPR on June 17.

Machine Learning Datasets

  1. Google Open Images Google AI introduced over 9 million images spanning 6,000 categories--”enough to train a deep neural network from  scratch.”
  2. ImageNet If you’re looking for an image database organized according to the WordNet hierarchy, give ImageNet a try.
  3. iMaterialist-Fashion Samasource and Cornell Tech announced the iMaterialist-Fashion dataset in May 2019, with over 50K clothing images labeled for fine-grained segmentation. The dataset is being using in the FGVC workshop at CVPR, which is co-sponsored by Google AI.
  4. Waymo Open Dataset Waymo released one of the largest, most diverse autonomous driving datasets to date. All you need is a Gmail account, and you can access the dataset.
  5. Visual Genome Visual Genome is the product of 9 technology professionals with a goal of connecting structured image concepts to language. 
  6. UCI Machine Learning Repository The University of California - Irvine (UCI) maintains 474 datasets as a service to the machine learning community. 
  7. Pew Research Center Gain access to raw data from survey research via Pew Research Center. An account is required to access their datasets, but registration is easy.
  8. Labelme Use the Labelme Matlab toolbox to access a large dataset of annotated images. 
  9. Labelled Faces in the Wild (LFW) Develop your facial recognition application using LFW, a collection of over 13,000 face photographs collected from around the web.

Dataset Finders

  1. Kaggle Data scientists and machine learners can find and publish datasets on Kaggle, an online community that was acquired by Google in 2017. Kaggle’s master list of datasets boasts a wide range of niche data sources.
  2. Amazon Web Services (AWS)With over 110 datasets and counting, you’ll find a web crawl of billions of web pages, NASA satellite imagery and more on the Registry of Open Data for AWS. If you want to add to the registry, of course there’s an AWS Labs GitHub repository for that.
  3. Google Dataset Search Google Dataset Search indexes datasets from digital libraries, personal websites and publisher pages, so you can find them when you need them. It’s currently in beta, but the predictive interface makes it easy to see what datasets are available on your selected topic at a glance.

This is just a small sample of the free, open source datasets that are available for machine learning use cases. If you have a dataset or dataset finder you’d like to add, leave a comment and let us know.

Request Demo

Sharon L. Hadden

Sharon is the Content Marketing Manager at Samasource where she's responsible for telling the story behind the company's impact sourcing mission and human-powered training data solutions. Sharon holds a MS in Integrated Marketing Communications and is passionate about helping social enterprises transform abstract concepts into results-driven marketing.