rio icon indicating copy to clipboard operation
rio copied to clipboard

Extract only certain files from zip

Open ggrothendieck opened this issue 1 year ago • 3 comments

I am currently doing this to only extract csv files from a zip file and wondered if there is a more direct way of doing this? Would have been nice if which= could be a pattern (regular expression or glob).

import_csvs_from_zip <- function(x, ...) {
  filenames <- rio:::.list_archive(x)
  csv_names <- grep("\\.csv$", filenames, value = TRUE)
  import_list(x, which = csv_names, ...)
}

import_csvs_from_zip("myzip.zip", rbind = TRUE)

ggrothendieck avatar Dec 16 '24 14:12 ggrothendieck

@ggrothendieck I agree that that's a nice idea. It changes significantly how the import_list() behaves.

https://github.com/gesistsa/rio/blob/f1094bf636f95f5dfcb05408d2ca0960fa792c8e/R/import_list.R#L4

We can only implement this kind of breaking changes in the next major version, if we must do that with which. Another approach is to have another parameter, e.g. similar to list.files() to have pattern.

chainsawriot avatar Dec 20 '24 10:12 chainsawriot

Some options to maintain backwards compatability would be:

  • have a different parameter, e.g. regex= which is like which but uses regex
  • have another argument such as fixed = TRUE which affects how which is interpreted
  • If the which argument has class "AsIs" (or some other decided upon class) then it would be interpreted as a regular expression, e.g. which = I(".*\\.csv$"), tidyverse does something like that but has its own class and wrapper, stringr::regex(...). For example see delim= argument in ?separate_wider_delim

ggrothendieck avatar Dec 20 '24 14:12 ggrothendieck

Another possibility would be to allow which= to specify a logical valued function that is applied to each name in the zip. Only those names for which the function returns TRUE are read. This would also be backwards compatible (if which= is a function it acts as described and if not it acts as it does now) and is powerful since it allows for many approaches within the function. For example the user could specify any of these to only read csv files:

which = \(x) endsWith(x, ".csv")
which = \(x) grepl("\\.csv$", x)
which = \(x) grepl(glob2rx("*.csv"), x)
which = \(x) substring(x, nchar(x) - 3) == ".csv"
which = \(x) tools::file_ext(x) == "csv"

If FUN is any of these or other function and x is a character vector of all names from the zip then the code below could be used internally to determine which names to read.

Filter(FUN, x)

ggrothendieck avatar Dec 22 '24 12:12 ggrothendieck