Tidy object from src_fst
My goal is to treat several "fst" files "as if" they were a single entity, and to act on all the files in a single step.
This would require being able to loop over the individual tables within a src_fst() output, but I have found this difficult.
The problem may well be more of an R question - how to dig into the output structure from src_fst() than a fstplyr question, but is there a way to reach into the output of fstplyr:;src() and grab the list of files?
` library(tidyverse) library(fst)
library(fstplyr)
write_fst(mtcars, paste0("mtcars", 1990, ".fst")) write_fst(mtcars, paste0("mtcars", 1991, ".fst")) write_fst(mtcars, paste0("mtcars", 1992, ".fst"))
junk <- fstplyr::src_fst(here::here())
str(junk) ` The structure of junk: List of 2 $ path: chr "/home/rob/Projects/1186684/anomaly" $ meta:List of 3 ..$ mtcars1990:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1990.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(, "class")= chr "fstmetadata" ..$ mtcars1991:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1991.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(, "class")= chr "fstmetadata" ..$ mtcars1992:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1992.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(*, "class")= chr "fstmetadata"
- attr(*, "class")= chr [1:2] "src_fst" "src"
Ideally I'd like to have
listoffiles <- some_magic_function(src_fst(path))
Alternatively, is there a way to tidy the src_fst output structure to accomplish this end?
Thanks.
I tried this, which was both ultimately unsuccessful and unnecessarily re-consults the filesystem. `
listoffiles <- fs::dir_info(here::here()) %>% select(path) %>%
filter(str_detect(path, "fst")) %>%
mutate(filename = str_sub(basename(path),1,-5)
) ` hoping that I could pass listoffiles to map and fstplyr , e.g.
map_df(listoffiles, fstplyr::tbl)
but this both failed (purrr::map) syntax and required reconsulting the filesystem. I was hoping to pluck everything out of src_fst()
The big idea is to take multiple fst objects and to treat them as one one dataset for analysis purposes.
Hi @aetiologicCanada,
thanks for sharing your code, for your case you could get the file names using:
sapply(junk$meta, function(x) {x$path})
#> mtcars1990
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1990.fst"
#> mtcars1991
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1991.fst"
#> mtcars1992
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1992.fst"
Off course, operations on batches of tibble's are fundamentally different from operations on single tables. Many operations have a map-reduce implementation that will lead to identical results, but for others that's not so obvious (think median() :-)).