psData icon indicating copy to clipboard operation
psData copied to clipboard

Cache data locally after download

Open vincentarelbundock opened this issue 11 years ago • 18 comments

Allow users to pass a flag that will cache the data locally after it is downloaded. Goals:

1- Reproducibility. Even if the online version disappears, the user will still have a local copy 2- Quicker download on second request.

Should be accompanied by a standalone cleaning script if applicable.

vincentarelbundock avatar Mar 05 '14 05:03 vincentarelbundock

This is a great idea. Are you thinking an R.cache dependency or something else?

Thomas J. Leeper http://www.thomasleeper.com

On Wed, Mar 5, 2014 at 6:29 AM, Vincent Arel-Bundock < [email protected]> wrote:

Allow users to pass a flag that will cache the data locally after it is downloaded. Goals:

1- Reproducibility. Even if the online version disappears, the user will still have a local copy 2- Quicker download on second request.

Should be accompanied by a standalone cleaning script if applicable.

Reply to this email directly or view it on GitHubhttps://github.com/rOpenGov/psData/issues/8 .

leeper avatar Mar 05 '14 06:03 leeper

+1

christophergandrud avatar Mar 05 '14 08:03 christophergandrud

No need for a dependency. Just let user specify a local path where to save stuff.

vincentarelbundock avatar Mar 05 '14 12:03 vincentarelbundock

Note that CRAN has reservations for writing on local files: "Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed." http://cran.r-project.org/web/packages/policies.html

But perhaps this is not a problem, if the package does not do this by default, but only after users active choice. Of course there should be an opportunity for packages that are designed for data access. But need to be careful how to implement that. My suggestion is to cache to local tmpdir.

antagomir avatar Mar 05 '14 12:03 antagomir

Just added a file argument to get_data. It allows the user to specify an RData file to save the data frame into. I'll push it up in a few minutes.

(I don't think this is a problem for CRAN as I have a number of packages that do this, the default is always set to not write the file.)

I do like the cache idea though. This allows the user to not have to think about their file paths, which can differ when they run the function on different computers, for example.

christophergandrud avatar Mar 05 '14 12:03 christophergandrud

Cool. I think there should also be a way to replicate the file processing (e.g. create unique country names, drop duplicate country years, etc.). So we'd need to save a R script as well.

vincentarelbundock avatar Mar 05 '14 13:03 vincentarelbundock

@vincentarelbundock currently it saves the downloaded and processed data frame as the last step before return(data) at the end of the function. Does this make sense or should we try something else?

christophergandrud avatar Mar 05 '14 13:03 christophergandrud

That's cool, but in my view it's not enough. If you're given only the raw and the processed data, you still might have to reverse engineer to find out what the transformations were. Was that log(variable) or log(variable+1)? Which duplicate countries did we drop, exactly?

I think what makes the most sense would be related to my issue here: https://github.com/rOpenGov/psData/issues/10

You would have 2 files:

  • database_political_institutions.yaml (download url, bibtex cite, etc.)
  • database_political_institutions.R (cleaning script with all transformations)

And a standardized function:

  • get_data(): Parse .yaml file, download data if not already cached, and run R script.
  • If flagged for caching, then copy yaml, raw data, processed data and R script to specified path.

vincentarelbundock avatar Mar 05 '14 13:03 vincentarelbundock

I agree that in principle it would be useful to have both raw data and the full script available in a reproducible format. If this gets too messy, however, it would be useful to at least store the package version and function name. This is already rather useful as such, and virtually sufficient for full reproducibility.

antagomir avatar Mar 05 '14 13:03 antagomir

I think your current setup is going to be much messier in the long run. "Silos" for each datasets are the only way I can think of to cleanly separate the cleaning scripts. That's a real simple way to scale the project...

vincentarelbundock avatar Mar 05 '14 13:03 vincentarelbundock

The function calls + package version is effectively a recipe. So for example, having an R script that went:

## Using psData 1.0
library(psData)

# Download data
Main <- get_dpi(vars = 'checks', sha1 = 'foo', file = 'Checks.rda')

# Set as psData
MainP <- as.panel(Main, 'Country', 'Year')

# Transform
AS OF YET NOT CREATED TRANSFORMATION FUNCTIONS

DROP OBSERVATIONS, ETC

It would definitely be helpful to add the package version and call to the metadata in as.panel (we might be able to do it a step earlier at get_dpi). Maybe it's possible to also output a log of the whole process. But I think fundamentally, this is reproducible silos and recipes. Am I missing something?

Note also, something that hasn't been started yet: @briatte did a good job of creating a way of running basic data sets through get_data to get a basic data frame. But we also need to have a system for data that requires more manipulation to get it into the panel-series data frame format. These were the 'variable builders' in the original version of psData. These would also effectively be recipes, if I understand you correctly.

Thoughts?

christophergandrud avatar Mar 05 '14 14:03 christophergandrud

Here's a repo I quickly put together as a proof of concept to show what I mean more concretely.

https://github.com/vincentarelbundock/dpi

I think you may be right. I guess I'm just suggesting that a slightly different repo structure with explicit silos would be cleaner, more scalable, and much easier to contribute to.

In my experience, generic functions called things like "get_data" tend to try to do too much, and end up being unwieldy. Better to provide some basic functionalities, and cleanly delineate which transforms were operated on each dataset.

Even if there is some code duplication, it is better for transparency to have a file called "database_political_institutions.R" that is more or less self-contained and can show the user exactly what is done in the pre-processing stage.

vincentarelbundock avatar Mar 05 '14 14:03 vincentarelbundock

I agree. Unfortunately I am personally too busy now with building the overall rOpenGov infra right now to contribute much code here. But based on our experiences I also think that Vincent's suggestion is the optimal way to go, if time will allow for others to advance this.

antagomir avatar Mar 05 '14 14:03 antagomir

I think there is some confusion among the various people working on the package.

In the original version, each data set was hard silo-ed into their own functions. Then it was recommended that these be spun off into separate packages that would be called by get_data. That would as much as possible provide a common syntax and output type, but apart from a few shared features (such as panel ID standardisation), each getter/variable builder silo would follow it's own recipe.

However, in subsequent collaborator updates get_data was reversed, so that the individual getter functions would essentially be particular calls of get_data.

I think this is the change that @vincentarelbundock is arguing against. I largely agree. However, I see the logic in having a core set of functionalities in get_data that basically deals with 'easy' data sets, i.e. those that can be downloaded from one URL and are already in panel-series format. This basically covers a lot of different data sets and will drastically cut down on code duplication.

I do think get_data needs to be changed so that it calls other recipe functions to handle non-vanilla data sets. The other option would be to have a different function called build_data that would pull in recipes for non standard data.

I would also recommend against baking in things like transforming and recoding the data in the get function beyond simply getting the data into a standardised panel-series format. These other decisions should be left to individual users.

Finally, it would be great to work on a way to dynamically generate meta data like bibtex entries and noting psData versions. It will easier over the long term than updating these manually.

christophergandrud avatar Mar 05 '14 14:03 christophergandrud

Thanks @christophergandrud , that's useful background info. I don't want to insist too much because I don't have a good sense of the 10,000ft view, but I agree that a common get_data user-facing function would be great.

One simple approach would be to have get_data be a simple wrapper that just calls the recipe cleaning script and returns the data. For "vanilla" datasets, the cleaning script will just be a single call to read.spss(url) or some such. That way, developers don't have to make distinctions between clean and non-clean data. Each of them have an associated loading/cleaning script. Some of those are just more complicated than others, but everything is very transparent and easy to see.

Plus there is a single point of entry for all datasets: a get_data(...) function where the three dots allow users to (potentially) pass arbitrary parameters to the cleaning scripts. Minimal overhead, standardized, high flexibility.

The info to include in the accompanying yaml files can be agreed on, and enforced at the Pull Request level.

vincentarelbundock avatar Mar 05 '14 14:03 vincentarelbundock

I agree about minimizing data transformations. My log example was a bad one. Think more about dropping YUG after break-up.

vincentarelbundock avatar Mar 05 '14 14:03 vincentarelbundock

@vincentarelbundock I really like your suggestion for the user facing get_data. That should be the way to go.

We should have a separate discussion on how to deal with the 'problem countries' like Yugoslavia and East/West Germany. These are important things to consider because they affect merging, which is a going to be a core psData functionality.

Right now there is a really crude method of notifying the user about/dropping duplicates. But this is kind of unsatisfactory.

christophergandrud avatar Mar 05 '14 14:03 christophergandrud

Yes, get_data as a wrapper for the full versions should be the way to go. We recently had related discussion in ropengov mailing list. It would be good to harmonize these conventions across different packages, and the model discussed here is the best candidate to me so far. Still need to think how to implement cacheing.

antagomir avatar Mar 05 '14 23:03 antagomir