8 min • 22 March, 2021
Data sets help you find information about anything and everything. It is a gold mine for data scientists looking to master their craft. But you don't have to be a data scientist to use data sets. For non-techies, it does require a certain level of understanding to make sense of all the data at your disposal.
The internet continues to bless us with tons of information. Yes, you guessed it. You can find data sets on the internet. Gone are the days when you have to go into humongous public libraries to find data on specific things from news journals and public data gathered across decades. It is simpler today, thanks to data sets on the web. One more reason why we love the internet.
Let us see eight different dataset websites where you can browse information about almost anything. Some of the sites have a focus. For example, NASA's earth data focuses on data about the earth and exploration of other planetary bodies. Data.gov focuses on public data assembled by the US government. Okay, let's dive in.
Searching for data sets in Google Dataset Search is as easy as searching for anything in Google Search! You enter the subject of the dataset you need to find and then click "search". For example, if you want to find a data set on coronavirus, type "coronavirus" and hit search. Google dataset search provides a platform for multiple data sets. So, you can search and find data in one place.
Another source of data from Google, is Google Trends. Google provides readily accessible data sets on search trends, and you can customize the parameters to easily find whatever it is you’re interested in. We recommend exporting the dataset and running it through [GYANA](https://gyana.com/features) for one-click visualizations and advanced analysis.
Most of the dataset websites on our list are open datasets. Some like the next one is a bit more technical than others. It is ideal for data science and data scientists.
Kaggle is more than a dataset website. It is a platform for data scientists, including competitions, short-courses, repositories, and project data sets. This is a great resource for people who want to expand their knowledge and polish up their skills. Kaggle provides a customizable Jupyter notebook environment with no setups. It offers access to a huge repository of free GPUs and data and code published by its community of data scientists.
Kaggle allows users to find and publish data sets, explore and build models in a web-based data science environment. It offers a way to collaborate with other data scientists and machine learning engineers. And participate in competitions to solve data science problems.
A typical Kaggle competition lasts three months. With a prize package of between 25,000-100,000 USD. It attracts approximately 1,000 experts. At least 10% of these experts are top talents. Despite the differences between Kaggle and typical data science, Kaggle can still be an excellent learning tool. Each competition is independent.
On Kaggle, you can fine-tune your data science skills. It is a data science community for savvy data analysts. However, if you are looking for machine learning datasets, you can try the UCI Machine Learning Repository. Which holds Machine learning Data compiled by the University of California Irvine.
It is a useful resource for any machine learning community. So, if you are looking to work on some machine learning projects, opeful this site will prove useful.
The next source on our list is an open dataset provided by the US government. You may be familiar with data.gov which is a comprehensive US dataset by the government of the united states.
Data.gov is a repository of the US government's public data. It offers data, tools, and resources for conducting research. Here you will also find data for developing web and mobile applications and tools and data for designing data visualization, and more. Data.gov has more than 200,000 data files covering a wide range of categories. The platform is powered by CKAN developed by the founders of DataHub, Adam Kariv, andf Rufus Pollock.
The US government updated Data.gov with a recent version in February 2021. The new catalog automatically collects more than 1,000 different data from federal, state, and local open data sources. With this latest version, data.gov runs on an updated version of CKAN as well, which has improved the process of automatically updating Data.gov with the latest dataset.
Adam Kariv and Rufus Pollock are the founders of DataHub. They also built CKAN, which powers DataHub. CKAN is the same program that powers data.gov and data.gov.uk datasets for the US and UK governments.
Datahub offers various solutions to publish and deploy data powerfully and effortlessly. Datahub is the fastest way for individuals, teams, and organizations to publish, deploy and share data. On their website, they say that DataHub is just a place where people can find tons of high-quality data, store their data, and share it with colleagues and others.
The founders say DataHub was originally a project initiated by Datopian and Open Knowledge International, and it has now grown to what it is today. DataHub represents the vision of data management and automation. It is a tool that can help you improve your ability to create and use high-quality data, thereby greatly improving convenience, speed, and reliability.
Growing up as a kid, most of us dreamt of exploring space. Oh, how we wish our tales of space exploration, a trip to the moon, intergalactic travels, dining with aliens, and playing galaxy games like Root and co. in the guardians of the galaxy would become true someday.
Those dreams can appear closer when searching on earthdata.nasa.gov, a repository of NASA's satellite observation data. It holds datasets about the earth, such as weather and climate measurements, atmospheric observations, ocean temperatures, vegetation mapping and more.
But it goes further. You can also have data from NASA's Planetary Data System. Which offers data from interplanetary missions such as InSight, OSIRIS-REx, New Horizons, MSL - Curiosity, MAVEN, Dawn, Juno LADEE, and more. So, if your dreams of space travel don't get to see the light of day, you can always visit Earth Data to quench your space-exploration cravings.
Want to test your ability to use highly complex datasets? There is only one place to visit. Go to the CERN open data portal. It holds more than two petabytes of data, including datasets from the Large Hadron Collider. It doesn't sound very comforting. But this amount of data shouldn't scare you. It is especially worth a look at if you are a particle physics enthusiast.
Although these datasets' names are overly complex, each entry will have a useful breakdown of what it contains, the related datasets, and how they are analyzed. In many cases, they even provide sample codes to help you on your journey.
In December 2020, the four main partners of the Large Hadron Collider (ALICE, ATLAS, CMS, and LHCb) unanimously approved a new open data policy for scientific experiments at the Large Hadron Collider (LHC). The policy addresses the public publication of so-called Level 3 scientific data collected by the LHC experiment. A type required for scientific research. They will publish the data approximately five years after the collection to make all data sets public by the end of the experiment. This policy focuses on the growing movement of open science and aims to make scientific research more replicable, accessible, and cooperative.
ProPublica, probably best known for their award-winning investigative journalism, collects data pertaining to various aspects of the US. They have both free and premium datasets, but don't be discouraged.
The amount of free datasets is notable, and we thoroughly recommend you to check ProPublica, particularly if you are looking for US related datasets about Health, Criminal Justice, Education, Politics, Business, Transportation, Military, Environment, Finance, or Religion.
The CDC collects the abundance of health data provided by US government research and sources, including data and research on alcohol, life expectancy, obesity and chronic diseases. This is a great resource if you are interested in analysing and understanding public health.
The ONS is a centralized repository holding data related to the United Kingdom. They have datasets on crime, the economy, public health and policy data, and it’s all readily available.
Johns Hopkins, a renowned private research university, provides extensive and reliable data on all facets of COVID-19 statistics and research. It is publicly funded and 100% independent, bias-free and, most importantly, pinpoint accurate.
You now, have a comprehensive list of some interesting dataset websites. We hope the list offers some insights into your dataset journey. Of course, if you don't find what you are after, there are more dataset websites on the internet to try out. Some are quite specific, while others supply a general catalog of datasets. Get lost in your search, and thanks for checking this list out.