fst
fst copied to clipboard
Convert a csv file directly to a fst file
The conversion needs very little memory, as we can use the rbind functionality of fst to append chunks from the csv file. The resulting fst file would have random row and column access and could be used to perform calculations of data sets that are too big to fit into memory.
@MarcusKlik Will it be easy also to add the ability to conver a .RData file directly to a fst file without using a large memory footprint?
That's an interesting idea. Off course, .rdata files can contain any number of objects, but files created with saveRDS contain only a single object, so that would be a better candidate. But that could be any object, not just a data.frame object (or similar). But suppose you would have a serialized data.frame, to avoid loading the whole '.rds' file into memory (and be back at square one), we would need to be able to selectively read single columns from that rds file. But I see no options for that in R's unserialize method. I think it's only possible to read a complete object, at least from R.
In R's source code for unserialize (line 1730), you can see that 'list-types' (like a data.frame) get de-serialized per column. So in theory it should be possible to adapt this code to read a rds file 'per column' and convert that to fst.
Would there be a use-case for converting rds files (or single object rdata files) ?
Forgot that .rdata and .rds file can contain all type of object.
If .fst file is really stable, I guess there is no very compelling reason for a conversion use case. But currently, I actually save both the .RData and .fst files for my data. And in case .fst file crash or doesn't work. Or in the future the .fst file format changes, I need to create a new .fst file. However, if as fst package getting more mature and stable, the need to convert .RData to fst will be less and less.
Yes, I completely understand. Unfortunately some more changes to the format are necessary to allow for multiple chunks (rbind) and adding columns (rbind), but also for storing key indexes and (custom) attributes in the fst file (all new features compared to the current version). The next milestone will have all these changes and after that fst will keep supporting previous releases. To make that work, more detailed versioning information was added to the format. So after the next release, fst will be backwards compatible with 'older' fst files and that will be tested with each release. That will make fst more mature and stable and hopefully convince you to erase your backups eventually :-).
Thanks, backward compatibility will help a lot.
Hi there, has this feature been introduced so far? Thank you a lot!