python-ecology-lesson
python-ecology-lesson copied to clipboard
Incompatible file (SQLite) among different download links
This is regarding the download links for data files in the setup page: http://www.datacarpentry.org/python-ecology-lesson/setup.html
The portal_mammals.sqlite file from the figshare (the teaching database), is not the same as the one you get from the following download links that are provided in the same section:
- http://www.datacarpentry.org/python-ecology-lesson/data/portal_mammals.sqlite
- https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/datacarpentry/python-ecology-lesson/tree/gh-pages/data (This is the one associated with "Download these files to your computer either by clicking this link").
In particular, the SQLite file that comes with the two links do not have "hindfoot_length" column in "surveys" table. And instead of "weight", "plot_id" and "species_id" columns, it has "wgt", "plot" and "species". It caused some confusion as we transition from Python to SQL module in a workshop last week.
I recommend to standardize all data files among all modules to the figshare version, and the two links should be removed from the page. From what I understand, the different versions of SQLite file do not affect the "Accessing SQLite Databases Using Python and Pandas" section.
@kokbent, I'm not sure what caused this discrepancy but in order for the lesson to be complete, it has to provide data files. The two "links" that you specify point to the same data -- just two different ways of getting it.
In the notes to the data on figshare
page, it says that the data is, actually, stored on GitHub at https://github.com/weecology/portal-teachingdb
So, here is what we can do:
- Add an additional minhashamal's link to that repo: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/weecology/portal-teachingdb
- Add a link to download that repository as a zip archive: https://github.com/weecology/portal-teachingdb/archive/master.zip
- Update our files in
data
directory- Update instructions where necessary
CC @wrightaprilm @trallard
Thanks for your patience on this; I've been traveling. I'm somewhat confused by the issue here. As @maxim-belkin notes, these two links point to the same data. A diff of the two shows them to be the same, and it looks like they have the same columns. Could you link to the workshop webpage? I'd like to reproduce the error myself.
Here's the "Data" section of the setup page: https://datacarpentry.org/python-ecology-lesson/setup.html
Data for this lesson is from the Portal Project Teaching Database - available on FigShare. We will use the eight files listed below for the data in this lesson. Download these files to your computer either by clicking this link , which will give you everything in a single compressed file. You’ll need to unzip this file after downloading it. Or download each file indvidually with the following links: surveys.csv species.csv speciesSubset.csv surveys2001.csv surveys2002.csv plots.csv bouldercreek_09_2013.txt SQL Database
There are three ways of downloading the sql database according to this section:
- Through the figshare link: https://figshare.com/articles/Portal_Project_Teaching_Database/1314459
- Through http://www.datacarpentry.org/python-ecology-lesson/data/portal_mammals.sqlite
- Through https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/datacarpentry/python-ecology-lesson/tree/gh-pages/data
The SQLite database in the figshare link is different from the two other links. And the figshare one is the one consistent with SQL lesson, the other two links are inconsistent with SQL lesson.
Figshare link's DB is like this:
Two other links (which are the same) DB is like this:
Ah, I see now. The figshare is a raw version of the portal DB, which gets converted for use with a python script. The surveys.csv and sqlite file we use are the outputs of the conversion. So this is here for data citation, not necessarily to download and use. So I propose we change the wording from "Data for this lesson is from the Portal Project Teaching Database - available on FigShare." to "Data for this lesson come from a published data set by Ernest et al., and can be viewed in their raw form on Figshare."
From a workshop instructor point of view, is it possible to change the SQLite database here to that of Figshare's version? I had half of the learners sticking to the "python version" rather than the "figshare version" (the one that is used in the SQL lesson, see https://datacarpentry.org/sql-ecology-lesson/setup.html) because they thought the SQLite file they got from the python lesson would be the same as the SQL lesson's one.
Oh, interesting, I didn't actually realize that the SQL lesson was doing something different (R uses ours). That seems like a fine strategy to me. @trallard and @maxim-belkin on board?
I think this issue should be ~escalated to~ brought to the attention of CAC
Fine by me. @fmichonneau - how do we ask for the CAC's help on this? To catch you up, there is an SQL file for the Python interacting with SQL lesson. It is different than the one used in the SQL lesson, and we're not sure how we want to resolve the issue (i.e., keep ours vs. adapt to the SQL lesson).
The CAC for the Ecology curriculum hasn't been established yet. I'm happy to provide guidance in the interim. I'm also CC'ing @tracykteal in she wants to chime in.
I'd suggest the python lesson get modified to use the version of the SQLite file that is hosted on Figshare. In general, none of the Data Carpentry lessons should host their own data files and instead rely on data hosted on Figshare.
OK, I moved the setup to FigShare links on my fork. We use a couple files for join and the capstone that aren't on Figshare. Two files (surveys2001 and surveys2002) are provided as examples of output. I think we should:
- Run through the lessons quickly and make sure they work with the Figshare data before merging
- Move surveys2001 and 2002 to sample output, or just delete them
- Add the other two datafiles to a figshare project so we have an external. Alternatively, I think it would be OK to delete speciesSubset, and have them take the first couple rows of the species.csv file as their dataset for the join activity. It's really just provided so there is a small enough dataset to see the different joins clearly and non-overwhelmingly.
Alright, so we've changed over so the text goes to an download of the data off FigShare, as opposed to relying on getting the data off github. The Boulder Creek data was contributed by you, @lwasser, I think? Could you point @fmichonneau to the original source of the data?
i don't remember contributing these data to this lesson or working on it (unless it was forever and ever ago??) ...BUT i recognize the data and the data are NWIS data and you get them here
https://waterdata.usgs.gov/nwis/inventory?search_station_nm=Boulder&search_station_nm_match_type=beginning&state_cd=co&format=station_list&group_key=NONE&list_of_search_criteria=state_cd%2Csearch_station_nm you can get data for each site and download it. does that help?
It would have been about 3.5 years ago now ... so forever;) @fmichonneau, let me know if this is what you need!
My goodness how time flies. Well I can help get the data so just let me know if you can’t find it. You’ll have to find the right site and then you can subset to whatever time period that you’d like. Also I think the raw data will have a big header at the top. There could be a little r API wrapper for these data. There is a python one.
ping @fmichonneau. This is back on my radar since now we have two sets of lessons using the same data (dc-py-es). It would be great if we could move away from GitHub and to Figshare, and apply that solution to both repos.
Fixed in #309