python-ecology-lesson icon indicating copy to clipboard operation
python-ecology-lesson copied to clipboard

Incompatible file (SQLite) among different download links

Open kokbent opened this issue 6 years ago • 15 comments

This is regarding the download links for data files in the setup page: http://www.datacarpentry.org/python-ecology-lesson/setup.html

The portal_mammals.sqlite file from the figshare (the teaching database), is not the same as the one you get from the following download links that are provided in the same section:

  1. http://www.datacarpentry.org/python-ecology-lesson/data/portal_mammals.sqlite
  2. https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/datacarpentry/python-ecology-lesson/tree/gh-pages/data (This is the one associated with "Download these files to your computer either by clicking this link").

In particular, the SQLite file that comes with the two links do not have "hindfoot_length" column in "surveys" table. And instead of "weight", "plot_id" and "species_id" columns, it has "wgt", "plot" and "species". It caused some confusion as we transition from Python to SQL module in a workshop last week.

I recommend to standardize all data files among all modules to the figshare version, and the two links should be removed from the page. From what I understand, the different versions of SQLite file do not affect the "Accessing SQLite Databases Using Python and Pandas" section.

kokbent avatar Jul 02 '18 17:07 kokbent

@kokbent, I'm not sure what caused this discrepancy but in order for the lesson to be complete, it has to provide data files. The two "links" that you specify point to the same data -- just two different ways of getting it. In the notes to the data on figshare page, it says that the data is, actually, stored on GitHub at https://github.com/weecology/portal-teachingdb

So, here is what we can do:

  1. Add an additional minhashamal's link to that repo: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/weecology/portal-teachingdb
  2. Add a link to download that repository as a zip archive: https://github.com/weecology/portal-teachingdb/archive/master.zip
  3. Update our files in data directory
    • Update instructions where necessary

CC @wrightaprilm @trallard

maxim-belkin avatar Jul 02 '18 18:07 maxim-belkin

Thanks for your patience on this; I've been traveling. I'm somewhat confused by the issue here. As @maxim-belkin notes, these two links point to the same data. A diff of the two shows them to be the same, and it looks like they have the same columns. Could you link to the workshop webpage? I'd like to reproduce the error myself.

wrightaprilm avatar Jul 12 '18 22:07 wrightaprilm

Here's the "Data" section of the setup page: https://datacarpentry.org/python-ecology-lesson/setup.html

Data for this lesson is from the Portal Project Teaching Database - available on FigShare. We will use the eight files listed below for the data in this lesson. Download these files to your computer either by clicking this link , which will give you everything in a single compressed file. You’ll need to unzip this file after downloading it. Or download each file indvidually with the following links: surveys.csv species.csv speciesSubset.csv surveys2001.csv surveys2002.csv plots.csv bouldercreek_09_2013.txt SQL Database

There are three ways of downloading the sql database according to this section:

  1. Through the figshare link: https://figshare.com/articles/Portal_Project_Teaching_Database/1314459
  2. Through http://www.datacarpentry.org/python-ecology-lesson/data/portal_mammals.sqlite
  3. Through https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/datacarpentry/python-ecology-lesson/tree/gh-pages/data

The SQLite database in the figshare link is different from the two other links. And the figshare one is the one consistent with SQL lesson, the other two links are inconsistent with SQL lesson.

Figshare link's DB is like this: sqlite1

Two other links (which are the same) DB is like this: sqlite2

kokbent avatar Jul 12 '18 23:07 kokbent

Ah, I see now. The figshare is a raw version of the portal DB, which gets converted for use with a python script. The surveys.csv and sqlite file we use are the outputs of the conversion. So this is here for data citation, not necessarily to download and use. So I propose we change the wording from "Data for this lesson is from the Portal Project Teaching Database - available on FigShare." to "Data for this lesson come from a published data set by Ernest et al., and can be viewed in their raw form on Figshare."

wrightaprilm avatar Jul 13 '18 13:07 wrightaprilm

From a workshop instructor point of view, is it possible to change the SQLite database here to that of Figshare's version? I had half of the learners sticking to the "python version" rather than the "figshare version" (the one that is used in the SQL lesson, see https://datacarpentry.org/sql-ecology-lesson/setup.html) because they thought the SQLite file they got from the python lesson would be the same as the SQL lesson's one.

kokbent avatar Jul 13 '18 18:07 kokbent

Oh, interesting, I didn't actually realize that the SQL lesson was doing something different (R uses ours). That seems like a fine strategy to me. @trallard and @maxim-belkin on board?

wrightaprilm avatar Jul 17 '18 21:07 wrightaprilm

I think this issue should be ~escalated to~ brought to the attention of CAC

maxim-belkin avatar Jul 17 '18 21:07 maxim-belkin

Fine by me. @fmichonneau - how do we ask for the CAC's help on this? To catch you up, there is an SQL file for the Python interacting with SQL lesson. It is different than the one used in the SQL lesson, and we're not sure how we want to resolve the issue (i.e., keep ours vs. adapt to the SQL lesson).

wrightaprilm avatar Jul 17 '18 21:07 wrightaprilm

The CAC for the Ecology curriculum hasn't been established yet. I'm happy to provide guidance in the interim. I'm also CC'ing @tracykteal in she wants to chime in.

I'd suggest the python lesson get modified to use the version of the SQLite file that is hosted on Figshare. In general, none of the Data Carpentry lessons should host their own data files and instead rely on data hosted on Figshare.

fmichonneau avatar Jul 20 '18 19:07 fmichonneau

OK, I moved the setup to FigShare links on my fork. We use a couple files for join and the capstone that aren't on Figshare. Two files (surveys2001 and surveys2002) are provided as examples of output. I think we should:

  • Run through the lessons quickly and make sure they work with the Figshare data before merging
  • Move surveys2001 and 2002 to sample output, or just delete them
  • Add the other two datafiles to a figshare project so we have an external. Alternatively, I think it would be OK to delete speciesSubset, and have them take the first couple rows of the species.csv file as their dataset for the join activity. It's really just provided so there is a small enough dataset to see the different joins clearly and non-overwhelmingly.

wrightaprilm avatar Aug 03 '18 21:08 wrightaprilm

Alright, so we've changed over so the text goes to an download of the data off FigShare, as opposed to relying on getting the data off github. The Boulder Creek data was contributed by you, @lwasser, I think? Could you point @fmichonneau to the original source of the data?

wrightaprilm avatar Sep 17 '18 18:09 wrightaprilm

i don't remember contributing these data to this lesson or working on it (unless it was forever and ever ago??) ...BUT i recognize the data and the data are NWIS data and you get them here

https://waterdata.usgs.gov/nwis/inventory?search_station_nm=Boulder&search_station_nm_match_type=beginning&state_cd=co&format=station_list&group_key=NONE&list_of_search_criteria=state_cd%2Csearch_station_nm you can get data for each site and download it. does that help?

lwasser avatar Sep 17 '18 18:09 lwasser

It would have been about 3.5 years ago now ... so forever;) @fmichonneau, let me know if this is what you need!

wrightaprilm avatar Sep 17 '18 18:09 wrightaprilm

My goodness how time flies. Well I can help get the data so just let me know if you can’t find it. You’ll have to find the right site and then you can subset to whatever time period that you’d like. Also I think the raw data will have a big header at the top. There could be a little r API wrapper for these data. There is a python one.

lwasser avatar Sep 17 '18 19:09 lwasser

ping @fmichonneau. This is back on my radar since now we have two sets of lessons using the same data (dc-py-es). It would be great if we could move away from GitHub and to Figshare, and apply that solution to both repos.

wrightaprilm avatar Dec 04 '18 17:12 wrightaprilm

Fixed in #309

btovar avatar May 19 '23 11:05 btovar