hyperSpec How to handle large data sets

This topic is touched on in several threads, I thought I'd start a dedicated thread. A while back I had a large data set to deal with, and ever since I've been watching for example R-pkg-devel for this topic. So going through my saved e-mails I found this discussion which seems quite relevant. In particular, the last two messages mentioning R.cache and drat. drat has several vignettes. These look like promising ways to package the data and access it as/when needed. The issue of where to put it remains of course.

May 18 '20 00:05 bryanhanson

This discussion might also be useful: https://www.r-bloggers.com/persistent-config-and-data-for-r-packages/

May 18 '20 13:05 bryanhanson

R Journal paper about drat

May 18 '20 19:05 cbeleites

This will be relevant for chondro.

May 18 '20 19:05 cbeleites

Record of GSoC weekly video call on 2020-05-18:

chondro will be replaced by a synthetic data set (#114, #125)
flu, laser, and paracetamol stay as they are: they are sufficiently small to not hinder anything.
They may be replaced by "static" versions rather than requiring them to be generated via make (#132)
barbiturates may be replaced by a synthetic data set in the future, but not now

May 18 '20 20:05 cbeleites

I will try to move chondro into a separate package with the help of drat.

May 19 '20 20:05 ximeg

I'm not delved deep enough to understand the essence of the problem, but I will state my point of view and will ask for explanations.

So while I was translating some of the vignettes, it was not clear for me, why are the datasets like chondro are created again and again every time the package is built. (As a reader of a vignette, I couldn't reproduce them fully, as e.g., it was not clear from the vignette, where should I find the data, but this is the other story). It is not clear to me, why the original spectroscopic files and the instructions on how to create the datasets that are used to illustrate the capabilities of hyperSpec are not in folder raw_data. From my point of view, the whole procedure is too complicated and it could be simplified (but most probably I do not see something important here). In my opinion, the datasets (like chondro) should be created only once and converted to a regular dataset of a package, e.g., by using usethis::use_data(chondro). You may read more on this at https://r-pkgs.org/data.html and in the documentation of:

?usethis::use_data_raw

So, could you summarize why this process to build the example datasets again and again is needed? Is it for unit testing?

May 19 '20 21:05 GegznaV

data(package = "hyperSpec")

Returns me this:

Some datasets (e.g., chondro) are not present in the list. Why is it so?

May 19 '20 21:05 GegznaV

@GegznaV you are basically correct, the storage and (re)generation of the data is complicated and a bit opaque. This summer we have a student @eoduniyi working on streamlining the whole package, thanks to Google Summer of Code. Data issues are getting a close look but it will take a while to address the wide range of issues.

May 19 '20 23:05 bryanhanson

The reason for "no raw_data/" is basically history:

The "externally built" vignettes in hyperSpec were around before .Rbuildignore existed. So back then, the only possibility was to have Sweave documents somewhere separate and then copy what should go into the package to the appropriate place of the package directory structure.
The raw_data/ convention is AFAIK quite recent (advanced R book?)
And yes, the re-generation in particular of fileio is/was basically a poor-man's unit test where the underlying files could not be shipped with hyperSpec since that would have meant a package size >>100 MB (there are/were even some code chunks in there that were labeled as unit tests).
Early on, the internals of hyperSpec objects changed every once in a while. Regenerating the object from its raw data ensured that things kept working.
chondro is special in that it would be too large to ship with hyperSpec. I therefore decided to ship basically a PCA compressed version. The parts of that data set are internal (in sysdata.rda), and the chondro object is created on the fly when required (via DelayedAssign()). This apparently has the side effect that it looks to R like a normal variable rather than a data set. Including the need to @export it.
The same will probably be the case with @bryanhanson's synthetic data set.

May 21 '20 19:05 cbeleites

Found this package and interesting discussion of options while looking for something else. Should look over this before going down any path.

May 29 '20 14:05 bryanhanson

Another post that might suggest some options https://blog.r-hub.io/2020/05/29/distribute-data/

Jun 07 '20 22:06 bryanhanson

A recent change on R devel might be of some use, but might also mess us up: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17777

Jun 24 '20 00:06 bryanhanson