RDatasets.jl icon indicating copy to clipboard operation
RDatasets.jl copied to clipboard

Future of RDatasets.jl

Open frankier opened this issue 1 year ago • 15 comments

This repository has quite a few issues asking for more data. There is a large amount of data on CRAN. It's not clear whether the current approach in this package is appropriate for reaching the "long tail" of datasets on CRAN.

As mentioned in this issue https://github.com/JuliaStats/RDatasets.jl/issues/47#issuecomment-445564722 by one measure this package is already complete: it has some data which be used for testing out Julia stats packages. By another measure it cannot be complete until it contains every dataset on CRAN.

Myself, I am rather interested in having more datasets from the fields of Educational Data Mining and Psychometrics -- hence the recent spate of pull requests. One possibility for making sure everyone gets what they need from this package going forward would be to split out all code into RDatasetsBase and create RDatasets with just some "core" datasets. Then, specific domains can be taken care of by their respective Julia orgs e.g. Ecology by EcoJulia, Psychometrics by a new Julia org which each have REcoDatasets, RPsychometricsDatasets and so on.

frankier avatar Aug 30 '22 12:08 frankier

i don't really julia, but in many languages, the general approach is to make more repositories for the new data, then link them from the original, so that a downstream consumer can easily choose to get just one thing or all the things

StoneCypher avatar Aug 30 '22 15:08 StoneCypher

There are surely different approaches. It would 100% be possible to replace or supplement this package with one that lazily downloads data from CRAN. The advantage to packaging in the data as is currently done is that no network connections are made at runtime, so there's more reliability. There's also no need to have the hassle of cache directories and the like. We also don't have to worry about CRAN going down or accidentally hammering CRAN, CRAN mirror selection. On the other hand, size would become a concern once there are lots of very large datasets.

My suggestion here was more along the lines of a minimal move from the current approach to something which can also support other Julia orgs where there is a lot of relevant data on CRAN.

frankier avatar Aug 31 '22 06:08 frankier

I have made a minimal initial version of the lazy downloader at https://github.com/frankier/RDataGet.jl

I haven't yet registered the package. Is there interest in transferring this to JuliaStats (keeping me as maintainer)? There is also the possibility of combining this lazy approach with the multiple dataset repo approach so that different domain orgs can keep repo of datasets without needing to hit CRAN as well as having everything on CRAN available for demos/examples.

frankier avatar Sep 04 '22 09:09 frankier

My preference as a very light user would be that the top [n] datasets covering the most common datasets used (e.g. penguins, iris, etc) are included. And that the package would lazy load the rest that are available.

alecloudenback avatar Sep 04 '22 15:09 alecloudenback

Interesting. I imagine we could support both "standard" datasets like now and also download additional datasets from CRAN. Have you considered using Pkg artifacts or DataDeps.jl for that? They sound like the right tool for this task.

nalimilan avatar Sep 04 '22 20:09 nalimilan

My preference as a very light user would be that the top [n] datasets covering the most common datasets used (e.g. penguins, iris, etc) are included. And that the package would lazy load the rest that are available.

Yes I think that this would be nice possible and the nicest default behavior. There are some edge cases -- e.g. CRAN packages have different versions whereas bundled data has only a single version, but I think it would be possible to have a reasonable defaults of using whatever bundled version there is while always getting the newest while making it possible to get specific versions of any dataset on CRAN if needed.

Interesting. I imagine we could support both "standard" datasets like now and also download additional datasets from CRAN. Have you considered using Pkg artifacts or DataDeps.jl for that? They sound like the right tool for this task.

Okay great to hear you are interested!

I did take a look at both, but as I understand both are really about referring to a static/fixed set of resources. On the other hand, there is the potential for allowing users to specify caching periods beyond a single Julia session, in which case we need some place to store the dataset. Some ducking reveals https://github.com/JuliaPackaging/Scratch.jl which provides per-package data directories and https://github.com/chengchingwen/OhMyArtifacts.jl which allows dynamic artifacts to be stored in them.

Would you be likely to be able to review a pull request based on adding this lazy downloading functionality to RDatasets.jl? Do you think this should build on Scratch.jl or OhMyArtifacts.jl?

frankier avatar Sep 05 '22 06:09 frankier

OhMyArtifacts.jl seems interesting. Scratch.jl is intended for data that is modified locally, which isn't the case here. Feel free to make a PR and I (or others) will try to review it.

nalimilan avatar Oct 12 '22 13:10 nalimilan

I came to this repo wondering if it had been "artifactized" yet and then found this discussion. Artifacts seem like a better fit in principle: these data sets are immutable, can be content-addressed and shared with any packages that want to use them. Serving them as artifacts will also allow our packager server system to cache and distribute them globally and ensure reproducibility in case the upstream data sets are modified over time. It's not uncommon for people hosting files to move them, delete them or silently modify them. Artifacts can also be marked as lazy, which will cause them to be downloaded on-demand rather than eagerly.

The only issue I can think of with artifacts is that they will not get "garbage collected" unless all the packages that refer to them get garbage collected by the package manager. That means that once you use a lazy dataset artifact referred to by RDatasets it will stay on your system forever unless you manually go in and delete it from the ~/.julia/artifacts directory, which is kind of hard to expect people to do since they have names like 992cbece4d077b391b853cc49316453621f18a07. Scratch spaces don't actually currently improve on this much: they are only cleaned up when all packages that use them are, which is exactly what happens with artifacts. There is, however, a clear_scratchspaces!() function that will delete all scratch spaces, whereas there's no API for cleaning up still-referenced lazy artifacts.

I think the best option might be to just use a mix of eager and lazy artifacts—eager for datasets you want to download by default and lazy for one you want to provide on demand—and provide and API from RDatasets for cleaning up artifacts.

StefanKarpinski avatar Oct 25 '22 15:10 StefanKarpinski

cc @staticfloat since he might find this discussion interesting

StefanKarpinski avatar Oct 25 '22 15:10 StefanKarpinski

My understanding is also that we cannot realistically use standard artifacts for all datasets that live on CRAN given their number and the fact that they can be updated at any time: that would require updating Artifacts.toml and tagging new releases all the time, right? The distinction between default (eager) and additional (lazy) datasets, possibly handled using different mechanisms, seems more appropriate.

nalimilan avatar Oct 25 '22 15:10 nalimilan

It could be automated but I guess that's pretty annoying. It's unfortunate that this makes RDatasets inherently unreproducible since you can't know what version of a data set was used.

StefanKarpinski avatar Oct 25 '22 15:10 StefanKarpinski

It's unfortunate that this makes RDatasets inherently unreproducible since you can't know what version of a data set was used.

It's not exactly unreproducible. In the current version of RDatasets.jl, the versions of the datasets are determined by the version of RDatasets, since they are bundled. In practice, I don't think they have changed between versions.

More generally, I see there being a few potential ways artifacts can be used in the context of a package like RDatasets.jl/RDataGet.jl:

  1. Mutable Artifacts.toml in the style of OhMyArtifacts.jl. In the case the artifacts belong to RDatasets.jl. This is useful that a single copy of each dataset is downloaded+prepared across all packages using it. It's also perhaps the most useful behaviour for REPL usage. It does not provide any guarantees the dataset you get from CRAN hasn't changed since the RDatasets.jl version was published, but then again why would it? It does guarantee the dataset stays the same from when you first download it. This is probably mostly fine for most usages.
  2. Static Artifacts.toml owned by RDatasets.jl. This would allow for datasets to be unbundled from the package/downloaded lazily, while still providing the RDatasets.jl version => dataset version guarantee.
  3. Static Artifacts.toml owned by a user package. This would allow for a user package version => dataset version guarantee which is probably what we want.

In the last two cases, I believe Artifacts as currently conceived would allow for an artifact without a download section. Users of the library could then ensure it is downloaded using RDataset.jl before usage.

For me an ideal scenario would be a mix between 1, 2, and 3. CRAN datasets are done using 1, while a manually prepared repository of non-CRAN datasets are dealt with using 2. For either of these, there would be some function intended to be used in the REPL which can import the artifact into a user's Artifacts.toml. This could be a reasonable model for any future "data repository" packages.

One wrinkle is that OhMyArtifacts.jl does not appear to use the same content addressable storage as Artifacts. If it did, we would have the nice property that datasets could be imported into Artifacts.toml without having to be refetched, which would be nice since this step should behave essentially like a git tag: giving a name to a hash in the content addressable store.

I have started experimenting in an RDataGet.jl branch with adding 1) to RDataGet.jl https://github.com/frankier/RDataGet.jl/commit/e4fe7fc6969210adcfb34722e97af71861c0c40c

I would like to maybe resolve the OhMyArtifacts.jl/Artifacts difference first. I will make an issue there which may point the way forward. Also comments on whether this seems like a good way forward would be useful.

frankier avatar Nov 02 '22 10:11 frankier

My understanding is also that we cannot realistically use standard artifacts for all datasets that live on CRAN given their number and the fact that they can be updated at any time: that would require updating Artifacts.toml and tagging new releases all the time, right?

Github Actions has entered the chat

StoneCypher avatar Nov 02 '22 21:11 StoneCypher

Sorry for barging in, but I'm quite curious about the idea of serving dataset with package server system. Wouldn't that be too much for the package server to cache? I mean as an end user, what I care most about a dataset version is that the format stay the same. I won't be worry if there're some addition/deletion of the samples or other kinds of small changes. OTOH, I could generate multiple wikipedia dump datasets by giving different timestamp, which give you different content hash, but does it make sense to cache them all with the package server?

chengchingwen avatar Nov 03 '22 11:11 chengchingwen

We already serve a huge amount of traffic through the package server system so I'm not worried about serving some medium sized datasets. We limit artifact size to 2GB iirc. While the format may be all that matters to you, others want their code to be reproducible in the sense of getting the same results. Artifacts ensure that because they are immutable and content-addressed.

StefanKarpinski avatar Nov 03 '22 21:11 StefanKarpinski