fst icon indicating copy to clipboard operation
fst copied to clipboard

Fill a data.table range with specific rows from read.fst

Open MarcusKlik opened this issue 8 years ago • 0 comments

With this feature you can populate say row 1001:2000 in a 1e6 row data.table with a 1000 row read from fst.read. All this is done in memory. This feature is very useful for combining data from multiple (fst) sources into a single result table without having the overhead of copies. For example, when performing the merge sort algorithm on a set of data files, you need to

  1. read first x rows from all files
  2. sort the resulting table
  3. write some rows to disk
  4. read next x rows form file with smallest first chunk
  5. sort resulting table
  6. goto 3

This can be performed efficiently in R by using data.table's fast sorting and populating the result table in memory. With such an algorithm operating on a collection of fst files, we basically have a method of sorting arbitrary large fst files without running out of memory (and it can be done with multiple threads!).

MarcusKlik avatar Feb 23 '17 20:02 MarcusKlik