eurostat icon indicating copy to clipboard operation
eurostat copied to clipboard

cache filtered tables, suggestion of functionality

Open pompm opened this issue 5 years ago • 6 comments

Hi, for off-line work or for minimize touching web database during debugging scripts, I think, that possibility of caching filtered tables can be useful. (Filters also are cached as separate file and simply tested for equality.) I attache a suggestion. Marek

get_eurostat_flt <- function(file, filters, cache_dir=NULL, ...) { if (is.null(cache_dir)) { cache_dir <- getOption("eurostat_cache_dir", NULL) if (is.null(cache_dir)) { cache_dir <- file.path(tempdir(), "eurostat") if (!file.exists(cache_dir)) dir.create(cache_dir) } } if (!file.exists(cache_dir)) stop("Cache dir ", cache_dir, " does not exist.") cache_file <- file.path(cache_dir, paste0(file, ".rds"))

new functionality:

cache_filters <- file.path(cache_dir, paste0(file, "_flt", ".rds")) if (file.exists(cache_filters)) { flt <- readRDS(cache_filters) } else { flt <- NA } if (identical(flt, filters) && file.exists(cache_file)) { tmp <- readRDS(cache_file) message("Reading cache file ", cache_file) } else { tmp <- get_eurostat(..., filters) saveRDS(tmp, cache_file) message("Table is cached at ", cache_file) saveRDS(filters, cache_filters) message("Filters are cached at ", cache_filters) } return(tmp) }

Usage: get_eurostat_flt(file="cz_sk", id="educ_thflds", time_format = "num", filters=list( indic_ed = "TC02_12", geo = c("CZ", "SK"), time = 2009:2010 ) )

pompm avatar May 16 '19 04:05 pompm

This might be useful. One problem is that (to my understanding) packages should not write on user's disk based on CRAN guidelines. If caching can be implemented in the working memory that could help.

antagomir avatar May 16 '19 06:05 antagomir

The current caching writes already to user's disk.

To me it is bit unclear what this new functionality would add. I think the get_eurostat() is already caching with or without filters.

I also regret a bit that we added caching inside the get_eurostat -function, as it does write to user's disk by default.

jhuovari avatar May 16 '19 06:05 jhuovari

Oh I forgot this. Nice that we got through. We could consider taking it outside.

antagomir avatar May 16 '19 07:05 antagomir

To @jhuovari: I think that function get_eurostat() do not cache tables with filters, there is: if (is.null(filters) || filters != "none") { cache <- FALSE }

pompm avatar May 16 '19 11:05 pompm

OK, I forgot that.

The reason was, i think, that for a filtered data a caching check should have been changed. Now arguments are added to a file name and based on that it is checked whether data is already in cache. However, filter argument could be too long for a file name.

If you could make a PR request, I think your solution would be fine. Right @antagomir ?

jhuovari avatar May 16 '19 11:05 jhuovari

Yes. Looking fwd!

antagomir avatar Jan 26 '20 09:01 antagomir

I return to this vintage issue from 2019 since I've been working on this now in 2023:

I have been thinking how to rework caching so that it would satisfy the following conditions:

  • Caching would be available for all downloads, not only whole datasets (bulk downloads or complete datasets downloaded from JSON api without filters -argument), as mentioned in this issue
  • The package would check if a superset of a filtered dataset (essentially a bulk download or complete dataset from JSON api) would already be available, making it unnecessary to download the filtered dataset from API, as mentioned in issue #257
  • Fix, or actually remove, the bits of code that caused confusion when to filter datasets and when not, as mentioned in issue #258

While I loved the elegant solution of embedding parts of the query to cache data file name as mentioned above by @jhuovari, this made it unfeasible to cache filtered datasets as filters can be of any length. I think modern filesystems are not that finicky with long filenames but even 260 characters long file names, that is the maximum in Windows, is quite long in my opinion. By hashing the query as an md5 hash string we can circumvent this limitation and continue to use mostly the same code for checking whether a file has already been cached or not.

By saving queries to a separate .json file in temp folder we can see if a superset of a filtered dataset has already been saved to disk and that can be used for filtering instead.

By making caching possible for all downloads instead of just some downloads we remove the confusion that came from having a cache = TRUE argument that was overridden by having a filters argument that was something else than NULL or "none".

See draft PR #267 for code changes and testing. Any feedback much appreciated.

pitkant avatar Aug 11 '23 07:08 pitkant

Closed with the CRAN release of package version 4.0.0

pitkant avatar Dec 20 '23 08:12 pitkant