Feature request, write_element does not materialize computation in in-memory SpatialData Object
When using sdata.write_element(), the Dask computation remains lazy in the in-memory SpatialData object, when SpatialData object is backed by disk.
So each time we do sdata.write_element(...), we need to call spatialdata.read_zarr(sdata.path).
See example:
import os
from tempfile import tempdir
import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel
#1) creation of a spatialdata object backed by a zarr store
sdata = SpatialData()
arr = da.random.random((1, 1000, 2000), chunks=(1, 500, 500))
sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )
sdata.write( os.path.join( tempdir, "sdata.zarr" ) )
#2) read it back from the zarr store
sdata = read_zarr( sdata.path )
#3) Do something with the data, and save to a new slot
arr = sdata[ "image" ].data *2
sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )
# this triggers computation, similar as .to_zarr(...)
sdata.write_element( "image_mul" )
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
if not layer.is_materialized():
print( layer ) # layer 'mul-...' is not materialized, i.e. computation is still lazy in the in memory sdata object, similar behaviour as `dask.array.to_zarr`
# if we load back from the zarr store, we see that layer 'mul-...' is indeed materialized, similar behaviour as `dask.array.from_zarr`
sdata=read_zarr( sdata.path )
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
if not layer.is_materialized():
print( layer )
Although this is in line with e.g. dask.array.to_zarr and dask.array.from_zarr, it would be a nice feature to have an extra parameter inplace, e.g. sdata.write_element(..., inplace=True), which takes care of reloading from the zarr store. Ideally this would only reload the element that was written to disk, not the complete Spatialdata Zarr store, which can be slow when there are lot of tables or shapes, as these are always loaded in-memory.
Also see https://github.com/saeyslab/harpy/issues/90 for additional context.
Thanks for reporting. I also endorse the use case. I wonder if a more explicit approach, were the users loads the object in-memory completely before calling write_element(), should be preferred.
Loading the object in-memory before calling write_element() indeed solves the problem:. user could trigger the computation with .persist(), see example workflow below. But for large images you don't want to do this, and instead give dask the option to write to the zarr store without loading the image in memory.
import os
from tempfile import tempdir
import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel
#1) creation of a spatialdata object backed by a zarr store
sdata = SpatialData()
arr = da.random.random((1, 1000, 2000), chunks=(1, 500, 500))
sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )
sdata.write( os.path.join( tempdir, "sdata.zarr" ) )
#2) read it back from the zarr store
sdata = read_zarr( sdata.path )
#3) Do something with the data, and save to a new slot
arr = (sdata[ "image" ].data *2)
sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )
# trigger computation via persist -> pulls "image_mul" in memory
sdata[ "image_mul" ] = sdata[ "image_mul" ].persist()
sdata.write_element( "image_mul" )
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
if not layer.is_materialized():
print( layer ) # layer 'mul-...' is materialized
I think writing pipelines using a spatialdata object backed by disk should ideally support idempotent functions, e.g., operation(sdata, ..., inplace=True), where the user does not have to worry about loading back from the zarr store, and layers not being materialized.
I see this issue is already mentioned here https://github.com/scverse/spatialdata/issues/521.
I think reloading/materializing computation could be implemented as follows:
import os
from tempfile import tempdir
import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel
#1) creation of a spatialdata object backed by a zarr store
sdata = SpatialData()
arr = da.random.random((1, 1000, 2000), chunks=(1, 500, 500))
sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )
sdata.write( os.path.join( tempdir, "sdata.zarr" ) )
#2) read it back from the zarr store
sdata = read_zarr( sdata.path )
#3) Do something with the data, and save to a new slot
arr = (sdata[ "image" ].data *2)
sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )
sdata.write_element( "image_mul" )
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
if not layer.is_materialized():
print( layer ) # layer 'mul-...' is not materialized
# wait for support for reading particular elements
sdata_temp = read_zarr( sdata.path, selection=["images"])
del sdata[ "image_mul" ]
# here we get warning that 'image_mul' already exists, probably should try to surpress this warning
sdata[ "image_mul" ]=sdata_temp[ "image_mul" ]
del sdata_temp
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
if not layer.is_materialized():
print( layer ) # layer 'mul-...' is materialized
Thanks for sharing!
I wonder if a more explicit approach, were the users loads the object in-memory completely before calling write_element(), should be preferred.
I'll expand on the above: we would like allow the data model to accept also non-lazy objects whenever we have also the lazy option. Tracked for instance here https://github.com/scverse/spatialdata/issues/153.
For instance right now it is not possible to do
sdata['my_image'] = sdata['my_image'].compute()
but we want to allow for this.
Do you think the approach you described based on .persist() could be replaced by what shown above or do you still see value in implementing both?
I think adding support for in-memory equivalent of elements would be a nice feature, but I would restrict it to the in-memory equivalents of the elements ( dask->numpy, daskdataframe->pandas,... etc ).
If a .to_memory() method would be implemented, it should be made clear that it converts dask to numpy etc, and not pull dask arrays or dask dataframes in memory (via persist()).
In short, I don't think spatialdata should call persist() internally, because it would probably confuse users.
Ok thanks for the info. Then I think a good way forwards would be to have the to_memory() implemented and documented and in addition explaining the .persist() use case to the users, so they can choose what fit them the most.
Indeed, to_memory() would be a nice feature, but I think it is somehow unrelated to a possible inplace or reload parameter for write_element(). I see two separate use cases.
-
sdata not backed by disk. For me currently, there are no issues here. Having
to_memorywould be nice, but I personally would seldom use this feature, since dask arrays, and dask dataframes are already in memory in this case. Maybe.compute()might be a clearer description of whatto_memoryis intented to do. One could argue that for consistency withdask, it would then also be a good idea to also implementsdata.persist(). -
sdata backed by disk. Having the option
reloadorin_placeinwrite_element(), so we do not have to read from the zarr store each time we write to the zarr store. I would understand if you prefer not to implement this feature, since it is in line with how dask writes to zarr stores.to_zarr(), and we can always reload downstream. Having the option to read (read_zarr( element="images/image" ))specific elements would already be a nice step forward for this use case.