fst icon indicating copy to clipboard operation
fst copied to clipboard

Subset by rownames

Open schelhorn opened this issue 7 years ago • 1 comments

Hello there,

awesome package! It appears as if fst currently does not support character rownames, though. For practical and compatibility reasons it would be nice to be able to subset by rownames as well as by columnames (the latter of which is already implemented, of course).

For my own use I added a simple fix, i.e. having an extra column with a "magic" name that stores the rownames at serialization, and a wrapper function that deserializes that magic column first and translates character row name queries to integer row indexes, takes their min/max and lets fst deserialize only the selected rows (or row block including both wanted and unwanted rows) and columns.

That approach is very fast since the single magic column can be deserialized and subset very quickly. Still, it's kind of an ugly bolt-on to an elegant package. Would it be possible to provide support for rownames in the fst core and metadata directly instead?

schelhorn avatar Jul 22 '18 15:07 schelhorn

Hi @schelhorn, thanks for your kind words and submitting your request!

In general, serializers in R take different approaches to the treatment of column names:

data.table::fwrite(df, "1.csv")  # no rownames
feather::write_feather(df, "1.fea")  # no rownames
utils::write.csv2(df, "2.csv")  # rownames
readr::write_csv(df, "3.csv")  # no rownames

The default data.table, feather and readr approaches are to disregard the columns names if present (and the readr documentation explicitly says that write_csv never writes row names). The philosophy behind that is nicely put in krlmlr's comment here. There is no real advantage to using column names instead of key columns. Also, subsetting data by using character column names is slow compared to other column types.

I understand your point that when you already have a data structure with row names, it now takes an extra effort to store (and retrieve) them. That effort could be incorporated in the fst package by allowing an argument row_names = TRUE. The downside is that not all table structures in other (non-R) languages support the concept of row names, so it might lead to compatibility problems in the future..

thanks!

MarcusKlik avatar Jul 22 '18 20:07 MarcusKlik