Mark Klik
Mark Klik
When a sorted data set is stored as a `fst` binary file, sorting metadata is stored alongside the data. Using this metadata, a binary search can be performed on the...
That would involve creating a `fst` file-connection object (similar to base-R `file` method). With that object data can be streamed row-by-row until the file is depleted (or the connection is...
`R` 3.5.0 brings some features from the `ALTREP` framework. One of those features is that the actual vector data can be stored in an alternative structure or location. Such a...
With this feature you can populate say row 1001:2000 in a 1e6 row `data.table` with a 1000 row read from `fst.read`. All this is done in memory. This feature is...
And provide fast compression with random access to the matix. Check if there is a use-case for such a feature.
See [here](http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/sharedmemorybetweenprocesses.html), boost allows for the creation of memory shared between processes. For `fst` that could mean that a single in-memory `fst` table can be shared between different processes. First...
Using code from the `microbenchmark` package directly from C++. See for example [this code](https://github.com/cran/microbenchmark/blob/master/src/nanotimer.c) for cross-platform timers.
Timing measurements with microsecond accuracy are needed to analyse the performance of OpenMP parallel constructs in the core code of `fst`. We need to determine the speedup due to parallel...
When reading a `fst` file using multiple cores, the slowest operation is the creation of character vectors (and to some extend also factors). That's because R uses a global string...
Processing character columns is by far the slowest of all data types. For character columns (that are not completely random) we can solve this problem by first converting the vector...