fsttable icon indicating copy to clipboard operation
fsttable copied to clipboard

Feature request: conditional select doesn't work

Open ssh352 opened this issue 6 years ago • 3 comments

 library(fsttable)
  nr_of_rows <- 5
  x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
  fst::write_fst(x, "1.fst")
  ft <- fst_table("1.fst")
  ft
  ft[X==1] # this doesn't work

really like this package thanks!

ssh352 avatar Sep 25 '19 16:09 ssh352

Hi @ssh352, thanks for reporting the issue!

yes, expressions in the i argument are not implemented yet. With fsttable the challenge is to minimize RAM usage for an expression

ft[i, j, by = "z"].

A first strategy might be:

  1. Scan arguments i, j and by for column names
  2. Load columns needed by selector i and read them
  3. Calculate the selector i (as integer or logical vector)
  4. Discard all loaded columns not needed by j or by
  5. Load (extra) columns needed for the by argument
  6. Apply selector (i) after reading each column
  7. Calculate grouping mask from by argument
  8. Discard loaded columns not needed by j
  9. Load (extra) columns needed by j
  10. Apply selector (i) after reading each column
  11. Process j taking into account the grouping mask (from by)

If needed, columns needed in more than one place can be discarded and reloaded to save RAM.

In a second stage, the combination of the grouping mask and the selector can be supplied to the underlying fst library, to do efficient reading. That way, we don't need to load complete columns from disk and the grouping is done automatically saving memory.

In a third stage, j can be applied to smaller (vertical) chunks of the dataset by loading only a few groups (by argument) at a time.

In a later stage, this processing of chunked datasets should be possible during background loading of the next chunk, to further increase performance.

This was just to order my thoughts, thanks again for submitting the issue!

MarcusKlik avatar Sep 26 '19 13:09 MarcusKlik

@MarcusKlik Hi Marcus, Is there any updates regarding the group by functionality. I have a very big fst file, which I loaded with fsttable and it was very fast. Now, the challenge involves some merging and by operations with normal data.tables. I was thinking of extracting the relevant columns. The grouping and merge functionalities would be very useful

akrun1 avatar Aug 14 '20 02:08 akrun1

Hi @akrun1, thanks for your request. Unfortunately, the group-by functionality is not implemented yet. If reading the fst file get's you into memory problems, your suggestion to read selections of columns before each grouping is probably your best option!

thanks

MarcusKlik avatar Aug 18 '20 18:08 MarcusKlik