More than ever people need data to support their opinions and decisions. Luckily there are a ton of free datasets online. Google is definitely your friend with any question about things online but a lot of the highest quality data sources don’t necessarily show up at the top in a Google search. This article is to direct you to my top 7 resources for finding high quality data online.
Google has created a separate search engine specifically for data sets. It is still in Beta so there may not be great results for every topic but this should be your first place to check out when looking for data.
Kaggle is a data science competition website. Different groups will post some data and a prompt. Users of the site then have a set amount of time to complete the project. The best part is that after their data is posted it stays on the website and is available to download for free. There are now over 12,000 datasets available on the site.
Github is the world standard for collaborative code repositories online. There is more than just code on Github, many projects on the platform have datasets that can be used. It is a great place to search for data and there is even a project that has another list of public data sources:
Most government agencies have a lot of data available for the public to download and make use of. You can find city, state, and federal datasets. There are data sets about the environment, economy, demographics, and much much more.
The World Bank provides a ton of different information about countries all of over the world.
FiveThirtyEight covers a wide range of news topics and always incorporates data into their articles. They now share many of the data sets they use. This is a great source for data about sports, culture, and politics.
Data world has a wide variety of datasets and allows you to easily collaborate with others on a given data project. This site does require you to create a login to access datasets.
For any data set you find online there are a few questions you should ask.
Should I trust this data source?
Consider the reputation of the source for the data, are they are a large institution or a single individual? if you are skeptical, check other sources of data around the same topic and see if the numbers still seem reasonable. Most of these sources I would rank as highly reputable. You should be a bit cautious of any community contributed data on websites like DataWorld or Github as they are likely not verified.
Could the data be inaccurate?
Investigate the data, come up with some estimates of what the maximum and minimum should be for any column and then see if any values are outside of that. An easy way to see this is to sort by each column in ascending and descending order to see what the maximum and minimum values are. To do this in Excel or Google Sheets select all of the data and then click the filter icon and then select A to Z and then Z to A options.
A lot of times data could have been entered incorrectly, instead of $11,000.00 someone might have typed in $1,100.00 or $11,00.00. Using the sorting options described above can help spot the most obvious examples of these. Another common example is that sometimes people don’t want to provide real data for things like phone numbers so you might get a lot of 9999999999 or 0000000000 in these columns.
The title of a column can also be misleading. For instance a field could be titled “% Employed” and the field could have 0.80 or 80 both meaning 80%. This can usually be figured out with context clues (what seems reasonable, what do other values in this column look like, etc.)
Could the data be incomplete?
A lot of times there is missing data within a dataset. It is a best practice to check for null or missing values in any dataset you want to use. In Excel you can do this by using the COUNTBLANK function for example, COUNTBLANK(B1:B3) in the image below results in a count of 1.
Is the data skewed?
Try visualizing the different columns of data in your data set. For numeric columns use a histogram. See what type of distribution there is for each column (normal, left, right, uniform, bimodal,etc). For non-numeric columns use a frequency table, is it mostly one value? Checking for these things will build your intuition about the overall data quality and about which columns to use in an analysis.
Many data tools let you quickly and easily check for all of these types of quality issues. Excel and Google Sheets are the quickest and easiest to use with any .csv or excel file. There are also more advanced tools you can use to check multiple columns at once such as Alteryx.