Let PointsModel be a dask or pandas dataframe
For many use cases (like xenium) points can be handled completley in memory without issue. Given that, and all the reasons the first "best practice" in the dask dataframes documentation is "use pandas", I think it's worth supporting pandas dataframes.
They have matched APIs, so it's possible many functions would not need modification to support both. For functions where we require one or the other, we can use a sklearn.utils.check_array style utility to do coercion.
While I see the argument in favor, I think we should keep only DaskDataFrame for the moment, for the following reasons:
- the xenium points fit in memory but loading is slow (as discussed here https://github.com/scverse/spatialdata/pull/141), same for cosmx etc. I like the fact that
read_zarr()returns immediately. Most of the times the user may not need points. - we may anyway replace Dask
DataFrameforGeoDataFrame - I think it would be useful to have an API like
read_zarr(load_images_in_memory=True, load_points_in_memory=True)etc., or a function `sdata.load_element_in_memory('element_type', 'element_name'), but we need to understand the implications of this with the io. Since the workflow with io is still to be discussed (as we saw the in this pr https://github.com/scverse/spatialdata/pull/138), I would wait until we have that clear before making the change described in this issue
Developing this point more, I think that the best would be to have everything lazy by default, even the shapes and the table, and then loading things in-memory if asked explicitly by the user, or if needed by an algorithm.
@gtca in postdata everything is lazy by default right? If I remember correctly if the user load something then it persists in memory, is it correct?
the xenium points fit in memory but loading is slow
This doesn't fit my experience. The _get_points function from spatialdata-io is fairly slow, but read_parquet seems quite fast.
%%time
transcripts_pq = pd.read_parquet(path / XeniumKeys.TRANSCRIPTS_FILE)
CPU times: user 6.37 s, sys: 1.7 s, total: 8.07 s
Wall time: 1.88 s
%%time
transcripts = _get_points(path, specs)
CPU times: user 20.8 s, sys: 2.64 s, total: 23.4 s
Wall time: 17.6 s
we may anyway replace Dask DataFrame for GeoDataFrame
I don't think this is a realistic option unless GeoDataFrame starts supporting 3d. Which is not on any roadmap as far as I can tell.
read_zarr(load_images_in_memory=True, load_points_in_memory=True)
This is what we're trying to do with read_dispatched in anndata. Could be useful to link spatial data up with the anndata IO system at some point.
I think it would be useful to have an API like read_zarr(load_images_in_memory=True, load_points_in_memory=True)
AnnData has .to_memory as well as an equivalent internal singledispatched to_memory function which we could export.
I see the speed improvements but I think that 1.88 seconds is too slow, especially with datasets that can have many more points (like MERSCOPE, see zulip message; there are 10x more points in general for a single dataset) or in the future when people will have multiple Xenium slices. I think we want the SpatialaData object to load immediately (lazy loading).
The .to_memory() function would be great, it would be worth try an implementation soon with that, so the user can load large portions of the data before long processing computations.
The entire wall time for this is in PointsModel.parse() here: https://github.com/scverse/spatialdata/blob/872f8c219fda0e348af056143af9be61a34e8ca5/src/spatialdata/models/models.py#L871
would suggest allowing for a mapping of column names to categories as input to PointsModel.parse() and reading the unique values with:
table = pq.read_table(path / XeniumKeys.TRANSCRIPTS_FILE, columns=["feature_name"])
unique_values = table.column("feature_name").unique()
which took about 2 seconds on my machine. its not lazy but its a very minor change and saves about ~13 seconds per categorical column when passing any dask dataframe to PointsModel.parse(). Basically just allowing the user to precompute this if they want.
In a somewhat related note... the pandas dataframe input protocol for PointsModel.parse() could also be dramatically improved by implementing all the formatting that takes place in pandas before converting to a dask dataframe.... most of the wall time there is spent adding columns back to a predefined dask dataframe.