stat545
stat545 copied to clipboard
bellybutton data: great example
Manually transferring from old STAT 545 website repo (https://github.com/STAT545-UBC/STAT545-UBC.github.io/issues/47).
Copied from what I posted in https://github.com/Reproducible-Science-Curriculum/rr-organization1/issues/41
Might make a good homework? For wrangling? Or data package creation?
Came to my attention via @zross and @Pakillo on twitter
Data on the biodiversity of belly buttons. You would get to say "belly button" a lot. And analyze innies vs outties.
http://navels.yourwildlife.org/bbb-project/results-and-data/
Basically they did lots of things right. It's a near miss. So fixing the problems is doable, would be very educational, and have a happy ending.
You could talk about
- renaming these files consistently
- depositing them somewhere more discoverable and persistent
- making data available in a non-proprietary format (it's xlsx only)
- within the xlsx (ok this heads into other areas, i.e. tidy data and spreadsheet hygiene)
- there's gratuitous human-targeted annotation in the header row (screenshot below)
- data stored in wide form, which is probably a good choice, but gives opportunity to discuss reshape after import
- metadata in a second worksheet, which definitely makes sense, but gives opportunity to practice joins
- human-targeted notes in a third worksheet which again makes sense, but gives opportunity to talk about what this would look like as, e.g. a git repository of a README plus 2 csv files and 1 or more R scripts
- data was collected in two waves, so there are two xlsx files; I've only looked at one, but I would bet $ that there are some interesting issues w/r/t extracting data from both spreadsheets and unifying into one dataset
Of course the link above is dead but I have this: https://github.com/jennybc/bellybutton
Just to finish the link circle: http://robdunnlab.com/projects/belly-button-biodiversity/
And don't worry, there is a petri dish portrait series: https://microbialart.tumblr.com/