How to handle large data sets
This topic is touched on in several threads, I thought I'd start a dedicated thread. A while back I had a large data set to deal with, and ever since I've been watching for example R-pkg-devel for this topic. So going through my saved e-mails I found this discussion which seems quite relevant. In particular, the last two messages mentioning R.cache and drat. drat has several vignettes. These look like promising ways to package the data and access it as/when needed. The issue of where to put it remains of course.
This discussion might also be useful: https://www.r-bloggers.com/persistent-config-and-data-for-r-packages/
This will be relevant for chondro.
Record of GSoC weekly video call on 2020-05-18:
-
chondrowill be replaced by a synthetic data set (#114, #125) -
flu,laser, andparacetamolstay as they are: they are sufficiently small to not hinder anything.
They may be replaced by "static" versions rather than requiring them to be generated viamake(#132) -
barbituratesmay be replaced by a synthetic data set in the future, but not now
I will try to move chondro into a separate package with the help of drat.
I'm not delved deep enough to understand the essence of the problem, but I will state my point of view and will ask for explanations.
So while I was translating some of the vignettes, it was not clear for me, why are the datasets like chondro are created again and again every time the package is built. (As a reader of a vignette, I couldn't reproduce them fully, as e.g., it was not clear from the vignette, where should I find the data, but this is the other story). It is not clear to me, why the original spectroscopic files and the instructions on how to create the datasets that are used to illustrate the capabilities of hyperSpec are not in folder raw_data. From my point of view, the whole procedure is too complicated and it could be simplified (but most probably I do not see something important here). In my opinion, the datasets (like chondro) should be created only once and converted to a regular dataset of a package, e.g., by using usethis::use_data(chondro). You may read more on this at https://r-pkgs.org/data.html and in the documentation of:
?usethis::use_data_raw
So, could you summarize why this process to build the example datasets again and again is needed? Is it for unit testing?
data(package = "hyperSpec")
Returns me this:

Some datasets (e.g., chondro) are not present in the list. Why is it so?
@GegznaV you are basically correct, the storage and (re)generation of the data is complicated and a bit opaque. This summer we have a student @eoduniyi working on streamlining the whole package, thanks to Google Summer of Code. Data issues are getting a close look but it will take a while to address the wide range of issues.
The reason for "no raw_data/" is basically history:
- The "externally built" vignettes in hyperSpec were around before
.Rbuildignoreexisted. So back then, the only possibility was to haveSweavedocuments somewhere separate and then copy what should go into the package to the appropriate place of the package directory structure.
Theraw_data/convention is AFAIK quite recent (advanced R book?) - And yes, the re-generation in particular of
fileiois/was basically a poor-man's unit test where the underlying files could not be shipped with hyperSpec since that would have meant a package size >>100 MB (there are/were even some code chunks in there that were labeled as unit tests). - Early on, the internals of hyperSpec objects changed every once in a while. Regenerating the object from its raw data ensured that things kept working.
-
chondrois special in that it would be too large to ship with hyperSpec. I therefore decided to ship basically a PCA compressed version. The parts of that data set are internal (insysdata.rda), and thechondroobject is created on the fly when required (viaDelayedAssign()). This apparently has the side effect that it looks to R like a normal variable rather than a data set. Including the need to@exportit.
The same will probably be the case with @bryanhanson's synthetic data set.
Found this package and interesting discussion of options while looking for something else. Should look over this before going down any path.
Another post that might suggest some options https://blog.r-hub.io/2020/05/29/distribute-data/
A recent change on R devel might be of some use, but might also mess us up: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17777