unconf17 icon indicating copy to clipboard operation
unconf17 copied to clipboard

Guidance/support for packages that provide data access

Open ateucher opened this issue 7 years ago • 19 comments

For packages that provide access to data, I can think of three common patterns:

  1. Data packaged locally in the data folder of the package and accessed with data(data-object) or data-object. (for example this blog post, and of course the data chapter in R Packages).

  2. Data accessed by wrapping a web API (AFAIK best practices well documented here)

  3. Data that is too large or frequently updated to bundle in a package but not available via API - the best method for dealing with this I know of is via the datastorr package.

A couple of things that I think could use a bit more fleshing out are:

  • Guidance on data licenses (especially in pattern 1 above): How to host on github and distribute them in a package in accordance with the terms of the data's license (particularly if different data sets are under different licenses). Related to #32.
  • Caching strategies for 2 and 3 (how and when): datastorr does a lot of this for you, but CRAN has policies around writing to a user's disk - see discussion here.

There is a lot of support out there for these methods, but since so many rOpensci packages provide access to data using one or more of these patterns, I wonder if putting together a guide of best practices that rOpensci can host and point package authors to would be helpful (even if it's mostly just linking out to other resources).

I know we're getting a lot of ideas for the unconf, so I'm happy to close this if it seems redundant or unnecessary.

ateucher avatar Apr 25 '17 17:04 ateucher

One wrinkle that I think is worth mentioning is the use of rappdirs as a coarse (more user-friendly in my opinion) way of achieving similar functionality as datastorr.

jsta avatar Apr 25 '17 18:04 jsta

Re: licenses, I just wondered about typical licenses for data packages. As a very crude first step, I pulled the description for CRAN packages using any sort of Creative Commons license or the 'Unlimited' license for inspection (also filtered for NeedsCompilation: no):

https://github.com/jennybc/cran-data-pkg-licenses#readme

I just needed a quick skim through, but it might be interesting to do this properly. It really should go the other way around: among data packages, what licenses are in use? But then you have the problem of determining what a data package is. I wish there was some sort of keyword or field that indicated that.

jennybc avatar Apr 26 '17 14:04 jennybc

@jsta I've also used rappdirs and agree it's really handy for caching. datastorr uses it as well. datastorr is specifically designed to deal with data stored in GitHub releases and so for other sources it wouldn't work so well. A more general approach is storr, which when used in conjunction with rappdirs is pretty slick.

ateucher avatar Apr 26 '17 17:04 ateucher

@jennybc that is really interesting, thanks! Yes, it would be helpful to try to do it the other way round - definitely worth thinking about.

An interesting one in the list you generated is @gaborcsardi's igraphdata. It is CC BY-SA 4.0 + file LICENSE, where the LICENSE file lists the license for each individual data set. In addition, each of the dataset's man pages lists the license of that data. I'm guessing since each dataset has a distinct license, the overarching CC BY-SA 4.0 applies mainly to the documentation?

ateucher avatar Apr 26 '17 17:04 ateucher

@ateucher that's correct, the CC BY-SA 4.0 is for the "package". CRAN does not allow a package without having a license for the "package as a whole". Which makes sense.

gaborcsardi avatar Apr 26 '17 17:04 gaborcsardi

Another option is to put the data into an appropriate scientific data archive, such as Zenodo, fishare, or KNB. In many ways this is similar to #2, an API, since these and many other scientific data repositories already provide an API for accessing the data. This has some added benefits that may or may not be relevant depending on the use case:

  • this approach is platform independent; your data is accessible to other platforms
  • you recieve a (version-specific) DOI for the data (which has several benefits, such as encouraging data citation, simplifying counting citations, better archival guarentee, and indexing in DataCite for search and discovery)
  • this data may be more discoverable depending on the metadata provided and the search interface

Of course these features won't be necessary or appropriate for every case, but they are worth being aware of. rOpenSci has a variety of packages to facilitate uploading & downloading data to such repos.

Most scientific repositories also have clear policies regarding licensing. There's general agreement that in the US data is not subject to copyright provisions (e.g. you cannot copyright a fact because it is a fact and not a creative work, though you can copyright the layout/format etc of how that fact is presented).

It's not 100% to me what the implications are for applying R package license to the "package as a whole", I would have thought the license on the 'package as a whole' had to be compatible with the licenses of all components shipped in the package. Assuming data is 'public domain' then of course it always compatible. (e.g. compare to rmarkdown, which ships with Bootstrap CSS under an MIT license, among others, while the "package as a whole" is under GPL-3. Of course MIT is GPL compatible so there is nothing wrong with this, but it implies no copy-left provision on the bootstrap code)

cboettig avatar Apr 26 '17 18:04 cboettig

Will second everything @cboettig said. I don't see a strong case for R data packages these days, especially for data from scientific papers. It's just messy all around. I do see a good case for data packages when the purpose is for teaching (a bunch of datasets in a CRAN package would make it easy to install even for novices).

I'm not including cases where software is being shipped and also includes some data (or code to retrieve data from a persistent archive).

karthik avatar Apr 26 '17 18:04 karthik

I think @karthik is right in that traditional data packages aren't necessarily the right way to go most of the time. I think sorting out best practices for point 3 in my original list is a worthwhile task at some point, but maybe not for the unconf.

I'm happy to let this issue lie, but I think if #32 gets traction we should make sure that data licensing is a topic that is also covered.

ateucher avatar Apr 28 '17 16:04 ateucher

Is guidance for documenting usage included in the scope of this project? It would be a useful task to find data packages that don't have vignettes, or don't have examples for the data functions, and contact the package maintainers to ask them to write the documentation.

Providing a template vignette, or links to existing good vignettes for data packages, would increase the chance that this maintainers would take heed.

richierocks avatar May 11 '17 19:05 richierocks

@richierocks I would say that's definitely in scope - good idea

ateucher avatar May 12 '17 22:05 ateucher

As a follow-up to my earlier comment, I have created a template repo that showcases my current lightweight rappdirs strategy: https://github.com/jsta/externalrdata

jsta avatar May 16 '17 12:05 jsta

Loosely related to this: As of today, I am the maintainer for the gpk package, which contains 100 datasets for statistical education.

I thought it would be a good project to polish it up to turn it into a really useful resource for people teaching stats.

richierocks avatar May 24 '17 18:05 richierocks

A quick summary, though with the huge array of amazing projects proposed, I'm not sure this one is coherent enough to fly at the unconf.

I think of the 3 original use cases I listed above, vanilla data packages are pretty well trodden ground, and perhaps most relevant to packaging datasets for teaching as @karthik mentioned. Best practices for documenting data however may be worth pursuing. @richierocks' gpk package looks like a good template and proving ground for best practices regarding documenting datasets.

The other piece which I think has a bit of room for work is building off @richfitz's storr and datastorr, and @jsta's externalrdata to compile best practices for caching data downloaded from external sources.

As I mentioned above, I think the subject of data licensing (for both included and externally sourced data) best belongs in #32.

I'm tempted to not propose this as a project at the unconf, but have enjoyed the discussion and will leave the issue open for now in case anyone wants to add anything or disagree with me 😄

ateucher avatar May 24 '17 21:05 ateucher

@ateucher My only comment would be that I think in the end code licensing and data licensing are very different things, but it's worth talking about and helping people understand why that is.

elinw avatar May 25 '17 13:05 elinw

The masterplan: https://docs.google.com/document/d/1LLVym79zX9fG5VGe4yVWgeEH65-r3JngAkEP4b6D7jc/edit?usp=sharing

richierocks avatar May 25 '17 18:05 richierocks

An example of using gh pkg. https://github.com/RL10N/RL10N/blob/master/data/scrape_has_po.R

richierocks avatar May 25 '17 18:05 richierocks

We are working in this repo: https://github.com/ropenscilabs/data-packages

ateucher avatar May 25 '17 19:05 ateucher

library(gh)
res <- gh("GET /search/code", q = "user:cran extension:Rd docType{data")

ateucher avatar May 25 '17 19:05 ateucher