Extract only certain files from zip
I am currently doing this to only extract csv files from a zip file and wondered if there is a more direct way of doing this? Would have been nice if which= could be a pattern (regular expression or glob).
import_csvs_from_zip <- function(x, ...) {
filenames <- rio:::.list_archive(x)
csv_names <- grep("\\.csv$", filenames, value = TRUE)
import_list(x, which = csv_names, ...)
}
import_csvs_from_zip("myzip.zip", rbind = TRUE)
@ggrothendieck I agree that that's a nice idea. It changes significantly how the import_list() behaves.
https://github.com/gesistsa/rio/blob/f1094bf636f95f5dfcb05408d2ca0960fa792c8e/R/import_list.R#L4
We can only implement this kind of breaking changes in the next major version, if we must do that with which. Another approach is to have another parameter, e.g. similar to list.files() to have pattern.
Some options to maintain backwards compatability would be:
- have a different parameter, e.g.
regex=which is likewhichbut uses regex - have another argument such as
fixed = TRUEwhich affects howwhichis interpreted - If the
whichargument has class"AsIs"(or some other decided upon class) then it would be interpreted as a regular expression, e.g.which = I(".*\\.csv$"), tidyverse does something like that but has its own class and wrapper,stringr::regex(...). For example seedelim=argument in?separate_wider_delim
Another possibility would be to allow which= to specify a logical valued function that is applied to each name in the zip. Only those names for which the function returns TRUE are read. This would also be backwards compatible (if which= is a function it acts as described and if not it acts as it does now) and is powerful since it allows for many approaches within the function. For example the user could specify any of these to only read csv files:
which = \(x) endsWith(x, ".csv")
which = \(x) grepl("\\.csv$", x)
which = \(x) grepl(glob2rx("*.csv"), x)
which = \(x) substring(x, nchar(x) - 3) == ".csv"
which = \(x) tools::file_ext(x) == "csv"
If FUN is any of these or other function and x is a character vector of all names from the zip then the code below could be used internally to determine which names to read.
Filter(FUN, x)