curriculum-development
curriculum-development copied to clipboard
Places to look for data
The section on datasets describes what to consider when selecting a dataset. It would be great to have a few suggested places to start looking for data. Here are the sites I use when looking for example data:
-
UCI machine learning datasets
- advantages: large, diverse datasets with attributes and applications already described (chemistry, biology, physics, engineering, social sciences, humanities)
- disadvantages: licenses not generally mentioned, though citation guidelines are described for each dataset; sometimes metadata is insufficient (but there are often links to other datasets)
-
Data Dryad: data repository primarily for biological datasets associated with published manuscripts
- advantages: CC0 license, authentic data associated with published research
- disadvantages: no consistency in formatting or metadata inclusion
-
Tidy Tuesday: datasets used for weekly data visualization challenges
- advantages: very diverse datasets, CC0 license, often with links to news articles; includes ideas for applications of data
- disadvantages: data aren't always well-curated
I'd be happy to flesh this out more in a PR if this is something folks would like to see in the main text. I can also add some additional verbiage on how to approach extracting tasks (coding, etc) from data.