fst
fst copied to clipboard
Feature Request: Port qs Concepts
The qs package seems to have developed some nice concepts.
Specifically I was wondering if it would be feasible to port Byte Shuffling and Alt-Rep Character vectors to fst?
It'd be interesting to see qs added to the fst benchmarking too.
Hi @phillc73, thanks for your questions!
Yes, the qs package uses some interesting concepts and these are definitely very relevant to fst as well.
Concerning Byte Shuffling, in fst I have implemented a custom byte shuffling algorithm that is used for integer and double vectors. The concept is the same as with the Blosc library: compression becomes much easier when similar bytes are close together. So for integers for example, the bytes b1, b2, b3 and b4 in each 16 kB block in the fst format are rearranged in such a way that all the b1's are adjacent. After that, compression is better and faster. The same holds for the 8 bytes of double values. Blosc can use the SSE2 instructions set for shuffling (SIMD extensions) but fst uses cache optimized algorithms which are also in the multiple GB/s speed range and optimized for 16 kB blocks (and SSE2 is hard to enable correctly on all platforms).
Concerning the ALTREP framework, I'm a great fan and think it can be used for many interesting developments in the future. Some possible enhancements for fst:
-
Writing datasets that use ALTREP vectors as columns:
fstcan detect the ALTREP vectors and use blockwise reads of the source data. That would allow conversion of off-memory datasets to the fst format (convert big csv files to fst without much RAM required). -
Reading a fst file and returning the column data as ALTREP vectors: That would allow data to be read only when needed. However, because
fstalready has random row- and column- access, I don't really see any benefits at the moment of returning a full ALTREP table, except perhaps for using the (still few) ALTREP awareRmethods (such assumormax). -
Returning character vectors as ALTREP vectors: creating and populating character vectors is very slow in
R. That's because of the single threadedRAPI that needs to check each newly created string against the existing string pool. Creating an ALTREP character vector is much faster because the actual API calls are not made until later when the elements are retrieved. So reading character vectors as ALTREP vectors will be much faster, but subsequent operations will be much slower because the strings still need to be materialized.
This last point is also important when comparing read speeds of different IO solutions. A specific read of an ALTREP vector might be faster but that speed gain might be lost again in subsequent operations. For example, I can see that the qs package uses a std::vector of strings to store the data internally (see here). That vector is materialized when a subsequent operation needs a character vector, so the slow API calls are deferred until that moment.
So the short version (:-)) is that byte shuffling is already implemented and enabling ALTREP awareness in fst and returning character vectors as ALTREP's is certainly interesting as future enhancements.
thanks!
As a side-note: I'm working on a new package lazyvec that will allow the user to test and create custom ALTREP vectors without writing any C/C++ code. It could be that a better model for combining ALTREP vectors with the fst package is by using this package as a dependency to another package that implements an interface to the on-disk data (such as fsttable).
So fsttable emulates a data.table by using ALTREP vectors for column data. This is done by using the lazyvec, data.table and fst packages as dependencies. The emulated data.table can be used as usual but when data.table needs column data, it is materialized on-the-fly (but not before). This will greatly reduce the memory footprint for large datasets.
(note that these packages are not ready for publication yet, just concepts at this point)
Thanks for the detailed responses. I certainly wasn't aware that byte shuffling was already integrated in fst and I'm looking forward to using lazyvec.
I've recently just chunked a rather large (for me!) single fst file of around 1 million rows using disk.frame. I loved the fact that fst already supports retrieval by row number, but with disk.frame I chunked my time series data by year and can now address a specific period quite easily. While I am sure the methodology varies greatly, but what are the fundamental differences in goals between disk.frame and fsttable?
Hi @phillc73, I tried to answer your question on the differences between fsttable and diskframe here, hope that helps!