spatialdata-io icon indicating copy to clipboard operation
spatialdata-io copied to clipboard

High RAM usage with the MERSCOPE reader

Open quentinblampey opened this issue 1 year ago • 2 comments

Writing MERSCOPE data may use a lot of RAM. This is because the .tif image chunks are not detected when using imread from dask_image, so the full image has to be loaded during image writing.

This issue has been reported by a user of Sopa here, who is experiencing issues with an image size of 2TB.

I don't really understand why dask_image is not considering the chunks, and I didn't find an obvious way to detect it (but maybe there is?). So I tried a bunch of different libraries, and it seems that rioxarray was able to fix this issue. I implemented this new function in Sopa to use rioxarray, it looks to work well but I think it needs more tests.

If the reader seems convincing after more tests on Sopa, maybe we can consider adding it to spatialdata-io? There is of course the drawback that it brings a new dependency...

quentinblampey avatar Mar 13 '24 16:03 quentinblampey

Hi Quentin thanks for reporting this. I would suggest (if you haven't already), to double check this also in the dask-image repository.

If the issue can't be fixed upstream in dask-image and if rioxarray (it also uses Dask right?) doesn't create problems with dependencies, I would be in favor of adding it. But if we have one less dependency I'd be for that instead.

LucaMarconato avatar Mar 13 '24 20:03 LucaMarconato

Hello @LucaMarconato, yes rioxarray also uses Dask. I'll try to see if I can obtain the same chunks with dask-image only, maybe looking at the source code of rioxarray could help. Indeed I would also be in favor of not adding rioxarray if it is possible!

quentinblampey avatar Mar 14 '24 07:03 quentinblampey

This is now addressed by https://github.com/scverse/spatialdata-io/pull/152, which gives the option to use rioxarray as a backend.

LucaMarconato avatar May 28 '24 16:05 LucaMarconato