fst icon indicating copy to clipboard operation
fst copied to clipboard

Should `metadata_fst` and `fst` come to a united workflow?

Open hope-data-science opened this issue 5 years ago • 20 comments

fst is a good function to get the general information, I think meta_fst could be renamed as "summary_fst" for the list returned by fst. Both fst and meta_fst recieves the path make one feel a little odd. Just opinion, thanks. I think fst should be used more widely for data scientists in the future.

hope-data-science avatar Jan 13 '20 10:01 hope-data-science

A certain need is, how could I select the columns I want according to the column names, e.g. select all columns ends with number or any regular expression. Thanks.

hope-data-science avatar Jan 13 '20 10:01 hope-data-science

I've found a way for your reference, maybe it could help the design of the function.

library(fst)
library(tidyverse)

fst("path_name") -> ft
ft[,names(ft) %>% str_detect("pattern1 | pattern2 | ...")] %>%  # or `str_subset`
  as_tibble() -> dt

hope-data-science avatar Jan 13 '20 11:01 hope-data-science

Hi @hope-data-science, thanks for your request!

If you want to select columns using dplyr semantics, perhaps @krlmlr's implementation of the dplyr syntax for fst is a good match for you:

# dplyr implementation for fst
devtools::install_github("krlmlr/fstplyr")

library(fst)
library(fstplyr)
library(dplyr)

# prepare database
path <- tempfile()
dir.create(path)
write_fst(iris, file.path(path, "iris.fst"))
write_fst(mtcars, file.path(path, "mtcars.fst"))

# select columns according to substrings
src_fst(path) %>%
  tbl("iris") %>%
  select(contains("Sepal"))
#> # A tibble: 150 x 2
#>    Sepal.Length Sepal.Width
#>           <dbl>       <dbl>
#>  1          5.1         3.5
#>  2          4.9         3  
#>  3          4.7         3.2
#>  4          4.6         3.1
#>  5          5           3.6
#>  6          5.4         3.9
#>  7          4.6         3.4
#>  8          5           3.4
#>  9          4.4         2.9
#> 10          4.9         3.1
#> # … with 140 more rows

is that what you were thinking?

MarcusKlik avatar Jan 14 '20 14:01 MarcusKlik

Thank you for the feedback. Acutually I am not quite satisfied with this solution, I have some other ideas in my mind. Below are the verbs I think should be in the design:

  1. parse_fst: same as fst right now, but easier to understand, and return a fst class
  2. slice_fst: select via row number
  3. select_fst: could select fst by name, regular expression or column number
  4. summary_fst: directly gives basic information of the data.frame, just like metadata_fst provides, with row number, column number, all the column names and first few lines (perhaps with last just like data.table, just like printed format of fst class)
  5. filter_fst: this could be realized by first extracting the variables in the condition, so as to get the index of rows, then use slice_dt to get the subset of data.

The ultimate goal of these APIs is to let the user get to know the data efficiently and extract the target information as soon as possible.

I am not quite familiar with the meta programming in R, but if you are interested in these ideas, let me know if there's anything I could help with. Again, thanks for your contribution.

hope-data-science avatar Jan 14 '20 15:01 hope-data-science

Hi, thanks for sharing your ideas!

In general, I think that the R community is blessed with very powerful API's for working with tables. In my opinion, the data.table and dplyr interfaces provide all the flexibility and performance a data scientist might need. And new interfaces like rquery are also emerging, providing ample choice I think.

In time, the goal for fst is to provide these well known interfaces in separate packages or integrated into fst, enabling off-line operations on large datasets. That will feel familiar to the user and probably help a lot with adoption of these packages.

Your summary request is a good idea I think, the offline table created with fst() should have a summary override, so for an offline table ft:

# create fst file
path <- tempfile()
fst::write_fst(iris, path)

# offline fst table
ft <- fst::fst(path)

the user should be able to call summary():

# display summary
ft %>%
  summary()
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 

You can do that now using ft[, ] %>% summary(), but the idea is that the results are calculated without any memory overhead, so without actually loading the entire table into RAM.

Would that be a useful addition to fst?

MarcusKlik avatar Jan 15 '20 09:01 MarcusKlik

Actually, the current state is quite good, good enough that while I want to write a tidyfst for it but stop after I use fst further (the API name could be changed to verb though, because turning a character to fst_table is somewhat anti-intuitive, my own preference is parse_fst. It is not a big deal after all). Simply using fst function could get lots of information already, and just type the fst_table could get the row and column number and the data type and name of each column. That is acutually all I want, and the names could work very well. summary might not be a useful function, but it is truly expensive (we have to read all data to get this summary).

How to let users get the target information directly from the limited information provided by fst_table? The way of data.frame is good, a "tidy" API might be good but not essential. I do prefer parsing a fst and then type ft to get all general information. Then from these information I select the target data I want by column selection and row selection. And filter_fst is acutually is good design from my opionion, because it is efficient (use the the conditional expression to the row index and get the target information). Though I myself do not need it acutually, I usually select the whole columns I needed and filter directly using dplyr or data.table.

A small improvement might be, when I select one column using the index, it never returns the column name, e.g. ft[,1] returns only the vector of first column but not a data.frame with column name.

hope-data-science avatar Jan 17 '20 05:01 hope-data-science

Some prototype of my functions (that are not all working, but could reflect some thinking):


library(fst)
library(stringr)
library(magrittr)



parse_fst = fst::fst

slice_fst = function(ft,numeric_vector){
  ft[numeric_vector,]
}

select_fst = function(ft,...){
  UseMethod("select_fst")
}

  
select_fst.numeric = function(ft,...){
  ft[,...]
}

select_fst.characer = function(ft,...){
  substitute(list(...)) %>% deparse() -> col_vars
  ft[[col_vars]]
}


filter_fst = function(ft,...){
  # select columns in the condition
  # get the row number that TRUE for the condition
  # use the row number to get the target information
}

summary_fst = function(ft) ft

hope-data-science avatar Jan 17 '20 05:01 hope-data-science

Hi @hope-data-science,

thanks for sharing, I understand that you would like to create a new package tidyfst along the lines of your tidydt package, right?

That's nice, please let me know if you need any help with that once you've created the package!

About the summary() method; the idea is to provide the summary information without loading the full table. Columns can be parsed one at a time, and in a later upgrade, the statistics can be provided by processing the column chunks directly in fstlib, requiring even less memory (and gaining parallel performance enhancements).

As you describe in your comments, there are various ways for the user to gain information on columns, types and content of a table stored in a fst file. The idea for fst is to provide an API that feels like working with in-memory files, so stay as close as possible to the existing API's.

But if you would like to provide a slightly altered API in your tidyfst package, one that is better tailed to your requirements, I'm all for it and will help you were I can off course!

MarcusKlik avatar Jan 17 '20 09:01 MarcusKlik

Currently there are only two issues I am facing: Selecting single column returns a sing class vector but not a data.frame;2.fst is used to deal with big data, perhaps data.table should be the default return. This is much safer. I've got some functions into tidydt instead of making a new package named tidyfst, hope it could be released soon.

hope-data-science avatar Jan 17 '20 14:01 hope-data-science

This is what I get so far (https://hope-data-science.github.io/tidydt/reference/fst.html), I think it is well enough for me, and do not have many overlaps with the current excellent fst.

hope-data-science avatar Jan 17 '20 21:01 hope-data-science

Hi @hope-data-science, nice work on the package and please let me know if you have any further questions!

MarcusKlik avatar Jan 19 '20 22:01 MarcusKlik

I knew this day would come, but not realized it would come so quickly. These is such an issue: I could never read the whole chunk of data into the memory to do further calculation. I know that diskframe could use distributed computation, but I still think that some aggregation could be done just within fst.

My real problem is: I have a daily dataset and want to get the yearly dataset by aggregation, but I could never read the whole chunk into my memory. One of the solution might be split the data by group and do them separately, it still demands the recognition of group within the fst file. Is there a way to do it?

Thanks.

hope-data-science avatar Jan 20 '20 12:01 hope-data-science

Hi @hope-data-science, at the moment, you can do a row-selection of data using:

library(fst)

# get a reference to an on-disk dataset
write_fst(cars, "cars.fst")
ft <- fst("cars.fst")

# on-disk subsetting
ft[ft$speed == 20, ]
#>   speed dist
#> 1    20   32
#> 2    20   48
#> 3    20   52
#> 4    20   56
#> 5    20   64

This will be efficient for this particular dataset, because column speed is ordered. But for unordered selections, the current implementation reads the complete range [row_min, row_max] between the first selected row and the last, so you might read more data than actually required.

(that will be remedied in the future however...)

MarcusKlik avatar Jan 23 '20 09:01 MarcusKlik

Nice! I think I can make the function filter_fst right away. This should be useful even currently it might not be fast enough.

hope-data-science avatar Jan 23 '20 10:01 hope-data-science

The work is almost done, though not published on CRAN yet. See https://hope-data-science.github.io/tidydt/reference/fst.html.

hope-data-science avatar Feb 01 '20 23:02 hope-data-science

Hi @hope-data-science, nice work, thanks for the heads up!

MarcusKlik avatar Feb 05 '20 09:02 MarcusKlik

Is there a way for me to get the path from a fst_table?

hope-data-science avatar Mar 19 '20 06:03 hope-data-science

Hi @hope-data-science, you can get the original filename from the fst_table object using:

tmp_file <- tempfile(".fst")

# example dataset
data.frame(X = 1:1000, Y = sample(1:10, 1000, replace = TRUE)) %>%
  fst::write_fst(tmp_file)

# get reference to file
fst_table <- fst::fst(tmp_file)

# low-level access to file path
.subset2(fst_table, "meta")$path
#> [1] "C:\\Users\\Mark\\AppData\\Local\\Temp\\RtmpMHO0fZ\\.fst36947d04290"

You can't get the list object through the normal channels (with [[ or $ operators) because of the operator overrides defined on the fst_table object :-)

MarcusKlik avatar Apr 03 '20 12:04 MarcusKlik

This is great! I could try build more facilities on it if possible.

hope-data-science avatar Apr 03 '20 12:04 hope-data-science