fst icon indicating copy to clipboard operation
fst copied to clipboard

read_fst should automatically read as data.table if written from data.table

Open renkun-ken opened this issue 7 years ago • 7 comments
trafficstars

I'm not sure if it makes sense that read_fst automatically reads a file as a data.table if it is written from a data.table. I'll certainly love this behavior as if as.data.table = "auto".

renkun-ken avatar Mar 18 '18 16:03 renkun-ken

sounds interesting. The only downside I can think of is that it ties fst to R a bit more.

xiaodaigh avatar Mar 19 '18 07:03 xiaodaigh

Hi @renkun-ken, thanks for your feature request and yes I see how that would be a useful feature.

It's related to the discussion in issue #120. Perhaps a parameter return_type with possible values data.frame, data.table, tibble and auto would be most convenient.

The thing is that when you use fst files for caching data or to store your own datasets, the feature makes a lot of sense. But if fst files are used for central storage of data or to transport data from one system into the next (R to Python for example), the feature is less obvious. Your collegae might prefer to work with the tibble container where you prefer to work with the data.table container object. Should your collegae override your custom settings (more like rds) to get what he wants, or should you specify how you want to load your data from a format that is agnostic to such distinctions (more like csv) ?

MarcusKlik avatar Mar 19 '18 12:03 MarcusKlik

The return_type idea looks nice. I believe this solves the problem of both sides.

renkun-ken avatar Mar 19 '18 13:03 renkun-ken

The downside of auto is obvious: the subsequent data operations are very different depending on whether the object is data.frame, data.table, or tibble. If the author of code is not the writer of fst data file, the author will not know exactly what type of object it is until the file is read. In this case, if auto is used by default, the code is exposed to risks that the writer of data changes his/her idea, i.e. switch from data.frame to data.table or tibble, and the reader's code is broken silently if the code author is not aware of that change in data file. Therefore, a responsible reader will almost surely specify a certain type in read_fst, thus avoiding auto, to ensure the subsequent operations work smoothly. This scenario makes auto less useful but quite risky instead.

renkun-ken avatar Mar 22 '18 02:03 renkun-ken

Hi @renkun-ken, thanks for sharing your thoughts. I agree that using auto is risky, because the fst source can be changed or updated. The same applies when we let the user set the default behavior of the return_type parameter with an option: a specific piece of code might run differently depending on the user options that where selected and code might break.

So, as you say, it's probably best to offer options tibble, data.frame and data.table for parameter return_type, but let go of auto.

Thanks for sorting that out.

MarcusKlik avatar Mar 23 '18 22:03 MarcusKlik

I would suggest that return_type can be defined by an option(), so it is set once and applied to every call to read.fst.

iagomosqueira avatar Aug 28 '19 09:08 iagomosqueira

Hi @iagomosqueira, yes, good idea. But it's probably best to warn the user on such a setting at package startup time. Otherwise, the option might enforce a specific table type (e.g. data.table) that might break code that depends on the default data.frame type...

MarcusKlik avatar Aug 29 '19 10:08 MarcusKlik