fst
fst copied to clipboard
Should `metadata_fst` and `fst` come to a united workflow?
fst is a good function to get the general information, I think meta_fst could be renamed as "summary_fst" for the list returned by fst. Both fst and meta_fst recieves the path make one feel a little odd. Just opinion, thanks. I think fst should be used more widely for data scientists in the future.
A certain need is, how could I select the columns I want according to the column names, e.g. select all columns ends with number or any regular expression. Thanks.
I've found a way for your reference, maybe it could help the design of the function.
library(fst)
library(tidyverse)
fst("path_name") -> ft
ft[,names(ft) %>% str_detect("pattern1 | pattern2 | ...")] %>% # or `str_subset`
as_tibble() -> dt
Hi @hope-data-science, thanks for your request!
If you want to select columns using dplyr semantics, perhaps @krlmlr's implementation of the dplyr syntax for fst is a good match for you:
# dplyr implementation for fst
devtools::install_github("krlmlr/fstplyr")
library(fst)
library(fstplyr)
library(dplyr)
# prepare database
path <- tempfile()
dir.create(path)
write_fst(iris, file.path(path, "iris.fst"))
write_fst(mtcars, file.path(path, "mtcars.fst"))
# select columns according to substrings
src_fst(path) %>%
tbl("iris") %>%
select(contains("Sepal"))
#> # A tibble: 150 x 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # … with 140 more rows
is that what you were thinking?
Thank you for the feedback. Acutually I am not quite satisfied with this solution, I have some other ideas in my mind. Below are the verbs I think should be in the design:
parse_fst: same asfstright now, but easier to understand, and return a fst classslice_fst: select via row numberselect_fst: could select fst by name, regular expression or column numbersummary_fst: directly gives basic information of the data.frame, just likemetadata_fstprovides, with row number, column number, all the column names and first few lines (perhaps with last just like data.table, just like printed format of fst class)filter_fst: this could be realized by first extracting the variables in the condition, so as to get the index of rows, then useslice_dtto get the subset of data.
The ultimate goal of these APIs is to let the user get to know the data efficiently and extract the target information as soon as possible.
I am not quite familiar with the meta programming in R, but if you are interested in these ideas, let me know if there's anything I could help with. Again, thanks for your contribution.
Hi, thanks for sharing your ideas!
In general, I think that the R community is blessed with very powerful API's for working with tables. In my opinion, the data.table and dplyr interfaces provide all the flexibility and performance a data scientist might need. And new interfaces like rquery are also emerging, providing ample choice I think.
In time, the goal for fst is to provide these well known interfaces in separate packages or integrated into fst, enabling off-line operations on large datasets. That will feel familiar to the user and probably help a lot with adoption of these packages.
Your summary request is a good idea I think, the offline table created with fst() should have a summary override, so for an offline table ft:
# create fst file
path <- tempfile()
fst::write_fst(iris, path)
# offline fst table
ft <- fst::fst(path)
the user should be able to call summary():
# display summary
ft %>%
summary()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
You can do that now using ft[, ] %>% summary(), but the idea is that the results are calculated without any memory overhead, so without actually loading the entire table into RAM.
Would that be a useful addition to fst?
Actually, the current state is quite good, good enough that while I want to write a tidyfst for it but stop after I use fst further (the API name could be changed to verb though, because turning a character to fst_table is somewhat anti-intuitive, my own preference is parse_fst. It is not a big deal after all). Simply using fst function could get lots of information already, and just type the fst_table could get the row and column number and the data type and name of each column. That is acutually all I want, and the names could work very well. summary might not be a useful function, but it is truly expensive (we have to read all data to get this summary).
How to let users get the target information directly from the limited information provided by fst_table? The way of data.frame is good, a "tidy" API might be good but not essential. I do prefer parsing a fst and then type ft to get all general information. Then from these information I select the target data I want by column selection and row selection. And filter_fst is acutually is good design from my opionion, because it is efficient (use the the conditional expression to the row index and get the target information). Though I myself do not need it acutually, I usually select the whole columns I needed and filter directly using dplyr or data.table.
A small improvement might be, when I select one column using the index, it never returns the column name, e.g. ft[,1] returns only the vector of first column but not a data.frame with column name.
Some prototype of my functions (that are not all working, but could reflect some thinking):
library(fst)
library(stringr)
library(magrittr)
parse_fst = fst::fst
slice_fst = function(ft,numeric_vector){
ft[numeric_vector,]
}
select_fst = function(ft,...){
UseMethod("select_fst")
}
select_fst.numeric = function(ft,...){
ft[,...]
}
select_fst.characer = function(ft,...){
substitute(list(...)) %>% deparse() -> col_vars
ft[[col_vars]]
}
filter_fst = function(ft,...){
# select columns in the condition
# get the row number that TRUE for the condition
# use the row number to get the target information
}
summary_fst = function(ft) ft
Hi @hope-data-science,
thanks for sharing, I understand that you would like to create a new package tidyfst along the lines of your tidydt package, right?
That's nice, please let me know if you need any help with that once you've created the package!
About the summary() method; the idea is to provide the summary information without loading the full table. Columns can be parsed one at a time, and in a later upgrade, the statistics can be provided by processing the column chunks directly in fstlib, requiring even less memory (and gaining parallel performance enhancements).
As you describe in your comments, there are various ways for the user to gain information on columns, types and content of a table stored in a fst file. The idea for fst is to provide an API that feels like working with in-memory files, so stay as close as possible to the existing API's.
But if you would like to provide a slightly altered API in your tidyfst package, one that is better tailed to your requirements, I'm all for it and will help you were I can off course!
Currently there are only two issues I am facing: Selecting single column returns a sing class vector but not a data.frame;2.fst is used to deal with big data, perhaps data.table should be the default return. This is much safer. I've got some functions into tidydt instead of making a new package named tidyfst, hope it could be released soon.
This is what I get so far (https://hope-data-science.github.io/tidydt/reference/fst.html), I think it is well enough for me, and do not have many overlaps with the current excellent fst.
Hi @hope-data-science, nice work on the package and please let me know if you have any further questions!
I knew this day would come, but not realized it would come so quickly. These is such an issue: I could never read the whole chunk of data into the memory to do further calculation. I know that diskframe could use distributed computation, but I still think that some aggregation could be done just within fst.
My real problem is: I have a daily dataset and want to get the yearly dataset by aggregation, but I could never read the whole chunk into my memory. One of the solution might be split the data by group and do them separately, it still demands the recognition of group within the fst file. Is there a way to do it?
Thanks.
Hi @hope-data-science, at the moment, you can do a row-selection of data using:
library(fst)
# get a reference to an on-disk dataset
write_fst(cars, "cars.fst")
ft <- fst("cars.fst")
# on-disk subsetting
ft[ft$speed == 20, ]
#> speed dist
#> 1 20 32
#> 2 20 48
#> 3 20 52
#> 4 20 56
#> 5 20 64
This will be efficient for this particular dataset, because column speed is ordered. But for unordered selections, the current implementation reads the complete range [row_min, row_max] between the first selected row and the last, so you might read more data than actually required.
(that will be remedied in the future however...)
Nice! I think I can make the function filter_fst right away. This should be useful even currently it might not be fast enough.
The work is almost done, though not published on CRAN yet. See https://hope-data-science.github.io/tidydt/reference/fst.html.
Hi @hope-data-science, nice work, thanks for the heads up!
CRAN already, FYI: https://hope-data-science.github.io/tidyfst/articles/example5_fst.html
Is there a way for me to get the path from a fst_table?
Hi @hope-data-science, you can get the original filename from the fst_table object using:
tmp_file <- tempfile(".fst")
# example dataset
data.frame(X = 1:1000, Y = sample(1:10, 1000, replace = TRUE)) %>%
fst::write_fst(tmp_file)
# get reference to file
fst_table <- fst::fst(tmp_file)
# low-level access to file path
.subset2(fst_table, "meta")$path
#> [1] "C:\\Users\\Mark\\AppData\\Local\\Temp\\RtmpMHO0fZ\\.fst36947d04290"
You can't get the list object through the normal channels (with [[ or $ operators) because of the operator overrides defined on the fst_table object :-)
This is great! I could try build more facilities on it if possible.