fst icon indicating copy to clipboard operation
fst copied to clipboard

FST matrix format

Open asgr opened this issue 6 years ago • 5 comments

This might be a simple or complex feature request. Basically, from my testing FST seems to be a great way of storing very large (many GB) 2D images and then selecting cutout regions on the fly (we need to do this all the time in astronomy).

The obvious overhead here is explicitly converting image matrix to data.frames and back. This doesn't take long in general, but it might be an easy change to add the option to return a matrix rather than a table. There might even be further gains in knowing there is no column meta data, and only one data type for the whole file.

Cheers,

Aaron

asgr avatar Sep 11 '19 04:09 asgr

Hi @asgr, thanks for your feature request!

Your issue is related to #154 and #19: for serializing a complete matrix you can use the code posted there. The idea is that because internally a matrix is just a wrapped around vector, data is stored in it's original order, so a one big single column.

For random access to the matrix stored in the fst file, things get a little more complicated. To keep the performance, the data should be stored in square blocks (say of 1024 x 1024 elements). That way, we don't need to define a column for each matrix column (the fst format stores the data per column, so many columns can be expensive). That feature is not implemented yet.

I hope the code is useful to you, at least until the fst API gets updated with matrix capibilities!

MarcusKlik avatar Sep 11 '19 10:09 MarcusKlik

Off course, if your cutout regions are relatively small, you could still retrieve them by calculating which parts of the 1D stored column data you need (so a small upgrade of the posted code sample)...

MarcusKlik avatar Sep 11 '19 10:09 MarcusKlik

Interesting- I’ve been saving a 12k x 12k image by converting to a data.frame and saving that. Even very large subsets are still very rapid to access (compared to other available tools). A real world use case for us is having 30k x 30k images that we want to do full random access on up to around 2k x 2k subsets. Both of these matrices are pretty much always square. I guess my comment is really that it is already fast enough (great if it can be faster!), but until things get overhauled would it be possible for data to be optionally placed in a row/colnameless matrix rather than only a data.frame? It would just remove the overhead of doing the final conversion back to matrix (which is often the slowest bit!)

asgr avatar Sep 11 '19 10:09 asgr

@asgr a workaround could be the Kmisc package that offers fast conversion from a data.frame to a matrix and vice versa (functions mat2df and df2mat)

ChristK avatar Sep 11 '19 11:09 ChristK

Off CRAN though, so not an ideal solution for portable code.

asgr avatar Oct 22 '19 07:10 asgr