fst icon indicating copy to clipboard operation
fst copied to clipboard

Option strings as factors=FALSE

Open pmakai opened this issue 8 years ago • 1 comments
trafficstars

The current version reads in factors to R, even if I select only 1000 rows. That means that ALL factor levels are read in, and this extremely memory inefficient.

An option strings as factors might be benefitial.

pmakai avatar Aug 10 '17 10:08 pmakai

Hi @pmakai, thanks for reporting your issue. Indeed, when a factor column has e.g. 1e6 levels and only 1e3 rows are actually read, all factor levels have to be imported. So in effect, a large read is done for a small subset of rows.

For those cases, adding a parameter stringsAsFactors (or factorsAsStrings really) to fstread will be useful.

The downside is that for a column that has been serialized as a factor, fst writes the levels and the actual values to disk in separate blocks. That means that given a subset of a stored table, fst has to 'search' the stored levels for the levels that correspond to the actual values. We can make that search memory efficient, but for randomly ordered data, it still requires reading most of the level data, so it will not be very performant. In the example above, there is a large probability that all 1e6 levels have to be read to lookup the 1e3 values selected.

So, the user has a few options:

  • Write the table without factor columns: reading small subsets will be fast but the fst file may be (much) larger for character columns with large elements.
  • Write the table with factor columns and read a subset with option factorsAsStrings == TRUE. The fst file will probably be small but due to the look-ups required, reading will be slow for small subsets (but it will be memory efficient).
  • Read subsets of the table that are equal in size or larger than the number of levels. That way, reading will probably be as fast as for subsets of non-factor columns.

I have to think this one through!

MarcusKlik avatar Aug 10 '17 11:08 MarcusKlik