Feature request: conditional select doesn't work
library(fsttable)
nr_of_rows <- 5
x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
fst::write_fst(x, "1.fst")
ft <- fst_table("1.fst")
ft
ft[X==1] # this doesn't work
really like this package thanks!
Hi @ssh352, thanks for reporting the issue!
yes, expressions in the i argument are not implemented yet. With fsttable the challenge is to minimize RAM usage for an expression
ft[i, j, by = "z"].
A first strategy might be:
- Scan arguments i, j and by for column names
- Load columns needed by selector i and read them
- Calculate the selector i (as
integerorlogicalvector) - Discard all loaded columns not needed by j or by
- Load (extra) columns needed for the by argument
- Apply selector (i) after reading each column
- Calculate grouping mask from by argument
- Discard loaded columns not needed by j
- Load (extra) columns needed by j
- Apply selector (i) after reading each column
- Process j taking into account the grouping mask (from by)
If needed, columns needed in more than one place can be discarded and reloaded to save RAM.
In a second stage, the combination of the grouping mask and the selector can be supplied to the underlying fst library, to do efficient reading. That way, we don't need to load complete columns from disk and the grouping is done automatically saving memory.
In a third stage, j can be applied to smaller (vertical) chunks of the dataset by loading only a few groups (by argument) at a time.
In a later stage, this processing of chunked datasets should be possible during background loading of the next chunk, to further increase performance.
This was just to order my thoughts, thanks again for submitting the issue!
@MarcusKlik Hi Marcus, Is there any updates regarding the group by functionality. I have a very big fst file, which I loaded with fsttable and it was very fast. Now, the challenge involves some merging and by operations with normal data.tables. I was thinking of extracting the relevant columns. The grouping and merge functionalities would be very useful
Hi @akrun1, thanks for your request. Unfortunately, the group-by functionality is not implemented yet. If reading the fst file get's you into memory problems, your suggestion to read selections of columns before each grouping is probably your best option!
thanks