13 Open Source Datasets for Machine Learning

Sharon L. Hadden

June 7, 2019

5 Minute Read

 

Artificial intelligence is a double-edged sword--on one hand, homes are smarter, health tech is advancing at rapid pace, and driverless vans will soon deliver your groceries.

On the other hand, privacy violations, discrimination and a whole host of effects not yet known or experienced give pause to these technologies.

Confronting the risks of AI begins with facing your data difficulties, including ingesting high-quality data before sorting, linking and programming even occurs.


Here are ten open source datasets for machine learning and three dataset finders, including one that was featured in the Fine-Grained Visual Categorization (FGVC) workshop at CVPR 2019 on June 17.

Machine Learning Datasets

  1. COVID-19 Open Research Dataset Allen Institute for AI partnered with leading research groups to prepare this research dataset of over 45,000 scholarly articles about COVID-19 and the coronavirus family of viruses.
  2. Google Open Images Google AI introduced over 9 million images spanning 6,000 categories--”enough to train a deep neural network from  scratch.”
  3. Waymo Open Dataset Waymo released one of the largest, most diverse autonomous driving datasets to date. All you need is a Gmail account, and you can access the dataset.
  4. ImageNet If you’re looking for an image database organized according to the WordNet hierarchy, give ImageNet a try.
  5. iMaterialist-Fashion Samasource and Cornell Tech announced the iMaterialist-Fashion dataset in May 2019, with over 50K clothing images labeled for fine-grained segmentation. The dataset was used in the FGVC workshop at CVPR, co-sponsored by Google AI.
  6. Fishnet.AI Working together with Samasource, The Nature Conservancy released Fishnet.AI, an AI training dataset for fisheries. This dataset of approximately 35,000 images with an average of 5 bounding boxes per image was collected from on-board monitoring cameras for long line tuna fishing activity in the Western and Central Pacific.
  7. Visual Genome Visual Genome is the product of 9 technology professionals with a goal of connecting structured image concepts to language. 
  8. UCI Machine Learning Repository The University of California - Irvine (UCI) maintains 474 datasets as a service to the machine learning community. 
  9. Pew Research Center Gain access to raw data from survey research via Pew Research Center. An account is required to access their datasets, but registration is easy.
  10. Labelme Use the Labelme Matlab toolbox to access a large dataset of annotated images. 
  11. Labelled Faces in the Wild (LFW) Develop your facial recognition application using LFW, a collection of over 13,000 face photographs collected from around the web.

Dataset Finders

  1. Kaggle Data scientists and machine learners can find and publish datasets on Kaggle, an online community that was acquired by Google in 2017. Kaggle’s master list of datasets boasts a wide range of niche data sources.
  2. Amazon Web Services (AWS)With over 110 datasets and counting, you’ll find a web crawl of billions of web pages, NASA satellite imagery and more on the Registry of Open Data for AWS. If you want to add to the registry, of course there’s an AWS Labs GitHub repository for that.
  3. Google Dataset Search Google Dataset Search indexes datasets from digital libraries, personal websites and publisher pages, so you can find them when you need them. It’s currently in beta, but the predictive interface makes it easy to see what datasets are available on your selected topic at a glance.

This is just a small sample of the free, open source datasets that are available for machine learning use cases. If you have a dataset or dataset finder you’d like to add, leave a comment and let us know.

Contact Us

Sharon L. Hadden

Sharon is the Content Marketing Manager at Samasource where she's responsible for telling the story behind the company's impact sourcing mission and human-powered training data solutions. Sharon holds a MS in Integrated Marketing Communications and is passionate about helping social enterprises transform abstract concepts into results-driven marketing.