fstplyr icon indicating copy to clipboard operation
fstplyr copied to clipboard

Tidy object from src_fst

Open ghost opened this issue 7 years ago • 2 comments

My goal is to treat several "fst" files "as if" they were a single entity, and to act on all the files in a single step.

This would require being able to loop over the individual tables within a src_fst() output, but I have found this difficult.

The problem may well be more of an R question - how to dig into the output structure from src_fst() than a fstplyr question, but is there a way to reach into the output of fstplyr:;src() and grab the list of files?

` library(tidyverse) library(fst)

library(fstplyr)

write_fst(mtcars, paste0("mtcars", 1990, ".fst")) write_fst(mtcars, paste0("mtcars", 1991, ".fst")) write_fst(mtcars, paste0("mtcars", 1992, ".fst"))

junk <- fstplyr::src_fst(here::here())

str(junk) ` The structure of junk: List of 2 $ path: chr "/home/rob/Projects/1186684/anomaly" $ meta:List of 3 ..$ mtcars1990:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1990.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(, "class")= chr "fstmetadata" ..$ mtcars1991:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1991.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(, "class")= chr "fstmetadata" ..$ mtcars1992:List of 7 .. ..$ path : chr "/home/rob/Projects/1186684/anomaly/mtcars1992.fst" .. ..$ nrOfRows : num 32 .. ..$ keys : NULL .. ..$ columnNames : chr [1:11] "mpg" "cyl" "disp" "hp" ... .. ..$ columnBaseTypes: int [1:11] 5 5 5 5 5 5 5 5 5 5 ... .. ..$ keyColIndex : NULL .. ..$ columnTypes : int [1:11] 10 10 10 10 10 10 10 10 10 10 ... .. ..- attr(*, "class")= chr "fstmetadata"

  • attr(*, "class")= chr [1:2] "src_fst" "src"

Ideally I'd like to have

listoffiles <- some_magic_function(src_fst(path))

Alternatively, is there a way to tidy the src_fst output structure to accomplish this end?

Thanks.

ghost avatar Feb 19 '19 21:02 ghost

I tried this, which was both ultimately unsuccessful and unnecessarily re-consults the filesystem. `

listoffiles <- fs::dir_info(here::here()) %>% select(path) %>% 
  filter(str_detect(path, "fst")) %>% 
  mutate(filename = str_sub(basename(path),1,-5)

) ` hoping that I could pass listoffiles to map and fstplyr , e.g.

map_df(listoffiles, fstplyr::tbl)

but this both failed (purrr::map) syntax and required reconsulting the filesystem. I was hoping to pluck everything out of src_fst()

The big idea is to take multiple fst objects and to treat them as one one dataset for analysis purposes.

aetiologicCanada avatar Feb 19 '19 21:02 aetiologicCanada

Hi @aetiologicCanada,

thanks for sharing your code, for your case you could get the file names using:

sapply(junk$meta, function(x) {x$path})

#>                                                                              mtcars1990 
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1990.fst" 
#>                                                                              mtcars1991 
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1991.fst" 
#>                                                                              mtcars1992 
#> "C:\\Users\\mklk\\AppData\\Local\\Temp\\RtmpmmtaR0\\reprex2c7048527a3b\\mtcars1992.fst"

Off course, operations on batches of tibble's are fundamentally different from operations on single tables. Many operations have a map-reduce implementation that will lead to identical results, but for others that's not so obvious (think median() :-)).

MarcusKlik avatar Feb 21 '19 09:02 MarcusKlik