spatialdata-io
spatialdata-io copied to clipboard
High RAM usage with the MERSCOPE reader
Writing MERSCOPE data may use a lot of RAM. This is because the .tif image chunks are not detected when using imread from dask_image, so the full image has to be loaded during image writing.
This issue has been reported by a user of Sopa here, who is experiencing issues with an image size of 2TB.
I don't really understand why dask_image is not considering the chunks, and I didn't find an obvious way to detect it (but maybe there is?). So I tried a bunch of different libraries, and it seems that rioxarray was able to fix this issue. I implemented this new function in Sopa to use rioxarray, it looks to work well but I think it needs more tests.
If the reader seems convincing after more tests on Sopa, maybe we can consider adding it to spatialdata-io? There is of course the drawback that it brings a new dependency...
Hi Quentin thanks for reporting this. I would suggest (if you haven't already), to double check this also in the dask-image repository.
If the issue can't be fixed upstream in dask-image and if rioxarray (it also uses Dask right?) doesn't create problems with dependencies, I would be in favor of adding it. But if we have one less dependency I'd be for that instead.
Hello @LucaMarconato, yes rioxarray also uses Dask. I'll try to see if I can obtain the same chunks with dask-image only, maybe looking at the source code of rioxarray could help. Indeed I would also be in favor of not adding rioxarray if it is possible!
This is now addressed by https://github.com/scverse/spatialdata-io/pull/152, which gives the option to use rioxarray as a backend.