Finding freely-available datasets on the web

How to find datasets for research?

There are many freely-available datasets on the web that you can use for research, or to play around with to practice analysis techniques. One of the best options may simply be to use a search engine by typing in your research topic of interest and dataset or "publicly available dataset". But if you have trouble finding a dataset, here are some places to start.

Some things to be aware of: many federal or large-scale datasets have more complex structures, so check into this before you decide which one you want to use. For example, it is quite common for datasets to be broken up into several different files: this is not a problem if all the varaibles that you want are in a single one of those files, but if the variables that you are interested in are broken across multiple files, you will have to merge the files first before you can do any analysis. This is doable, but can be challenging if you don't yet have experience with it; if you want to try it, you can read about how to do that here: https://www.stata.com/manuals/dmerge.pdf.

Many datasets also come with weights--weights are basically a way of changing how individual datapoints are weighted so that some data points "count" more or less in the model, in order to better reflect the characteristics of the population. For example, in a dataset where the only variable is gender, if women were twice as likely to answer the survey as men, men's responses might each be counted twice. With many different variables, generating a weight is more complex, but using the weight in your models is fairly straightforward if you want to try it--you can include the option fweight or pweight in your linear regression: here is a page that briefly describes the different options and how to use them: https://www.reed.edu/psychology/stata/gs/tutorials/weights.html. There are also more complex survey designs which are much more complicated to model (e.g., if you see the. If you are interested, you can read more about complex survey design and data collection here: https://stats.oarc.ucla.edu/stata/seminars/applied-svy-stata13/#:~:text=The%20probability%20weight%2C%20called%20a,be%2010%2F3%20%3D%203.33.). These usually use the stata svy command, which you can read more about on that page.

However, none of that is necessary for this class. For this class, I would suggest that you try to find a dataset that is self-contained in a single file, and doesn't require any complex weighting. (You can also play around with a dataset while ignoring the weights--this will work fine in terms of your regressions running successfully, it is just that your outcomes will be biased--they will only represent a non-representative sample from your population. In terms of how much that matters to your outcomes: sometimes it makes little difference, and sometimes it changes the outcomes of regression quite a bit.)

Some places to find datasets, to get you started:

US department of Education datasets: https://www2.ed.gov/about/data/list.html

Here is a helpful list from UC San Diego (but be careful, many of these datasets have complex structures): https://ucsd.libguides.com/data-statistics/education

This is a data repository with social science research, and where AERA asks authors to deposit their datasets to adhere to open access requirements (these datasets are not vetted, they are simply deposited there by researchers, who should also have included information about how data was collected): https://www.openicpsr.org/openicpsr/