pkgr icon indicating copy to clipboard operation
pkgr copied to clipboard

Feature Query/Request: Support for Additional_repositories

Open billdenney opened this issue 5 years ago • 6 comments

Does pkgr support the "Additional_repositories" feature of DESCRIPTION files? I would see support as complex since it would only be enabled for packages as dependencies of the package which has the "Additional_repositories" feature set (and conflicts would then have to be managed separately for that relative to other packages).

I am planning to likely use it for an upcoming package to support SDTM where I will need to host the data files for the controlled SDTM vocabularies somewhere else as an initial look suggests that they will be ~20-40 MB and require updating approximately quarterly.

See the following for considerations:

  • The post that pointed me to the feature: https://stat.ethz.ch/pipermail/r-package-devel/2019q4/004784.html
  • The article about it: https://journal.r-project.org/archive/2017/RJ-2017-026/RJ-2017-026.pdf
  • An example package using it: https://cran.r-project.org/web/packages/grattan/index.html

billdenney avatar Dec 15 '19 18:12 billdenney

Thanks for the request Bill. The answer is both yes and no.

Yes as you could (today) add any other cranlike repo below your repo's being pointed at and it will then find those just depedencies just fine. No in that it doesn't scan and look for additional repositories.

Here are some off-the-head thoughts:

  • we should definitely support additional_repositories in some capacity
  • I do not like that its reasonably hidden from users and would not be clear from looking at the yml file what the external dependencies are.

So what are the potential solutions:

  1. provide a helper that would auto-add any additional_repositories detected

a command such as:

pkgr sync --additional-respositories

I am already thinking around some other synchronization scenarios, for example @kylebaron brought up the reasonable scenario of in a pinch you do a quick install.packages, but ultimately want to make sure those packages get into the pkgr.yml for future use - having a command to add all packages not detected to have been installed by pkgr to the yml.

  1. Just do this 'magically' but transparently, so it would autoname them something like pkgname_repo: url so when planning/installing it becomes clear other repos are getting tapped. This could a default behavior, but also turned off with a new flag we added Strict: true (currently this will cause the library path to must-exist in order to install packages, vs creating it for the user)

This isn't a intensive ask, and we could get it in early next year with the next dev cycle - I'll slot it for 1.0

dpastoor avatar Dec 15 '19 19:12 dpastoor

Of those two options, I'd prefer the magic option 2. The rationale for 2 is that if I'm asking for a package and the package knows where its additional packages should come from, then go there. And, those additional repositories should be the lowest priority which assumes that there would not be a conflict between package names in those repositories and the main repositories.

billdenney avatar Dec 16 '19 01:12 billdenney

@billdenney you could try bioconductor. They don't have the package size requirements and fit well in the CRAN system; I believe that CRAN and bioconductor are both installed in CRAN checks, which means you can "suggest" a package on bioconductor (I think)

They have a non-tidy style and prefer camelCase and other such things (though I'm unsure if it is a requirement). I think that STDM being on a cran-friendly repository would be a win.

mattfidler avatar Dec 16 '19 18:12 mattfidler

@mattfidler yes cran plays nice with bioconductor. I actually don't even think

I can't find off the top of my head, but the rstudio team is actively discussing this on some of the repos I follow and saw pass through the issue tracker recently. In their context, same thing, machine learning datasets they wanted to pass around and were mulling some sort of caching option.

This also begs the question of, is packaging even the right solution.

If you make it an additional_repositories + suggests people are already almost certainly going to need to 'intervene' in order to pull the packages down, even if it means doing install.packages(..., deps = TRUE).

What about just hosting them as static assets that can get pulled down by running dl_dataset(...). or the like

This would also give the benefit of simple(r) access outside the package context, cross functional use (once we all move to Julia :-p) and allow to cherry pick specific files. Eg if you want to run only one example, not needing to DL all the examples.

Don't get me wrong, using a package would give some wins as well, but you're really just looking for an easy way to shuttle around data, not really needing to leverage the rest of the goodies an R package gives or some existing codebase you want to slurp in outside cran.

dpastoor avatar Dec 16 '19 18:12 dpastoor

@mattfidler, good thought! I just checked, and they seem to suggest <= 5MB for size as well (https://www.bioconductor.org/developers/package-guidelines/#correctness). I'd guess that they are more flexible, though.

I have an initial skeleton here: https://github.com/billdenney/Rsdtm/

My thought is that I will release a data-generator package on CRAN, store the generated data packages on GitHub in a drat repo (or similar), and release the SDTM package on CRAN. That way, everything required is on CRAN, but the simpler way to use it will require non-CRAN.

billdenney avatar Dec 16 '19 18:12 billdenney

@dpastoor, We were typing at the same time. :)

I did consider the "download a dataset" with caching, but in this particular case, I don't think it's a great fit. If you feel otherwise, please let me know!

The data I'm working with here is already publicly available (https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/), and the work to be done will both compress it and ready it for use in the specific R package. Were someone to make another consumer of it (e.g. Julia), they would need to perform many of the same manipulations to convert those data into a form for their language. So, I don't think that the hosted data set will be of general interest outside of package users.

There are some parts that could be of general interest such as simpler mappings to convert from older to newer data names, but overall, I think that those will be minimal.

billdenney avatar Dec 16 '19 18:12 billdenney