archive
archive copied to clipboard
Possible to read from a multi-file archive?
I'm trying to read the contents of an archive (~3GB) with many little files in it (~100k) without decompressing the archive. I don't have the ability to reconfigure what the archive looks like. Here's a reprex of what the archives look like. The directory structure is the same and the files all have the same columns...
library(readr)
dir.create("data", showWarnings=FALSE)
write_csv(iris, "data/iris_a.csv")
write_csv(iris, "data/iris_b.csv")
write_csv(iris, "data/iris_c.csv")
archive_write_files("data.tar.gz",
c("data/",
"data/iris_a.csv",
"data/iris_b.csv",
"data/iris_c.csv"))
archive("data.tar.gz")
I'd like to do something like...
read_csv(c("data/iris_a.csv", "data/iris_b.csv", "data/iris_c.csv"), id = "file")
... but without first unpacking data.tar.gz
.
If I do...
> read_csv(archive_read("data.tar.gz"), id = "file")
Rows: 0 Columns: 1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 0 × 1
# … with 1 variable: file <chr>
# ℹ Use `colnames()` to see all variable names
Because it's reading the first entry which is the directory itself. I see that I can skip the first seat in the archive and instead do...
> read_csv(archive_read("data.tar.gz", 2), id = "file")
Rows: 150 Columns: 6
── Column specification ──────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 150 × 6
file Sepal.Length Sepal.Width Petal…¹ Petal…² Species
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 archive_read(data.tar.gz)[2] 5.1 3.5 1.4 0.2 setosa
2 archive_read(data.tar.gz)[2] 4.9 3 1.4 0.2 setosa
3 archive_read(data.tar.gz)[2] 4.7 3.2 1.3 0.2 setosa
4 archive_read(data.tar.gz)[2] 4.6 3.1 1.5 0.2 setosa
5 archive_read(data.tar.gz)[2] 5 3.6 1.4 0.2 setosa
6 archive_read(data.tar.gz)[2] 5.4 3.9 1.7 0.4 setosa
7 archive_read(data.tar.gz)[2] 4.6 3.4 1.4 0.3 setosa
8 archive_read(data.tar.gz)[2] 5 3.4 1.5 0.2 setosa
9 archive_read(data.tar.gz)[2] 4.4 2.9 1.4 0.2 setosa
10 archive_read(data.tar.gz)[2] 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows, and abbreviated variable names ¹Petal.Length,
# ²Petal.Width
# ℹ Use `print(n = ...)` to see more rows
Building off of this, I could extract the contents of the archive and then step through each of the files with map_dfr
...
library(purrr)
files <- archive("data.tar.gz")[["path"]][-1]
names(files) <- files
map_dfr(files, ~read_csv(archive_read("data.tar.gz", .x)), .id = "file")
Is there an easier way to read everything from the archive in without having to do the map_dfr
step and incurring any other overhead from using both archive
and archive_read
?