curriculum-development icon indicating copy to clipboard operation
curriculum-development copied to clipboard

Places to look for data

Open k8hertweck opened this issue 3 years ago • 0 comments

The section on datasets describes what to consider when selecting a dataset. It would be great to have a few suggested places to start looking for data. Here are the sites I use when looking for example data:

  • UCI machine learning datasets
    • advantages: large, diverse datasets with attributes and applications already described (chemistry, biology, physics, engineering, social sciences, humanities)
    • disadvantages: licenses not generally mentioned, though citation guidelines are described for each dataset; sometimes metadata is insufficient (but there are often links to other datasets)
  • Data Dryad: data repository primarily for biological datasets associated with published manuscripts
    • advantages: CC0 license, authentic data associated with published research
    • disadvantages: no consistency in formatting or metadata inclusion
  • Tidy Tuesday: datasets used for weekly data visualization challenges
    • advantages: very diverse datasets, CC0 license, often with links to news articles; includes ideas for applications of data
    • disadvantages: data aren't always well-curated

I'd be happy to flesh this out more in a PR if this is something folks would like to see in the main text. I can also add some additional verbiage on how to approach extracting tasks (coding, etc) from data.

k8hertweck avatar Mar 13 '21 22:03 k8hertweck