Categories missing with highly partitioned dask dataframes in PointsModel

Open jonas2612 opened this issue 2 months ago • 0 comments

Reproduce using the blobs dataset

import numpy as np
import pandas as pd
import dask.dataframe as dd
from spatialdata.datasets import blobs

s = blobs()
tbl = next(iter(s.tables.values()))
df = tbl.obs.copy()
n_cats = 15
cats = pd.Index([f"G{i:04d}" for i in range(n_cats)], dtype="string")

rng = np.random.default_rng(0)
n = len(df)
k_front = min(10_000, n)
front = rng.choice(cats[:20], size=k_front)
back  = rng.choice(cats, size=n - k_front)
df["gene"] = pd.Index(np.concatenate([front, back]), dtype="string")

ddf_many = dd.from_pandas(df, npartitions=217)

c1 = ddf_many["gene"].astype(str).astype("category").head(1).cat.categories
print("many partitions, categories seen via head(1):", len(c1))  # typically ~20

ddf_as_known = ddf_many["gene"].astype("category").cat.as_known()
print("with .cat.as_known(), categories:", len(ddf_as_known._meta.cat.categories))

Describe the bug The setting of the categories in L888 of scr/spatialdata/models/models.py is not taking into account all categories. If the categories are set per partition, not all categories will be properly registered, leading to an inadvertent filtering of the points dataframe and feature loss.

To Reproduce See example above on blobs dataset or datasets of the Allen Brain Atlas (https://knowledge.brain-map.org/abcatlas#AQEBSzlKTjIzUDI0S1FDR0s5VTc1QQACSFNZWlBaVzE2NjlVODIxQldZUAADAAQBAAKEUL8fg4IJfwOFLj12hMQ92QQyTlFUSUU3VEFNUDhQUUFITzRQAAWBr6ZKgemsDoGggUeAktXoBgAHAAAFAAYBAQJGUzAwRFhWMFQ5UjFYOUZKNFFFAAN%2BAAAABAAACFZGT0ZZUEZRR1JLVURRVVozRkYACUxWREJKQVc4Qkk1WVNTMVFVQkcACgALAVRMT0tXQ0w5NVJVMDNEOVBFVEcAAjczR1ZURFhERUdFMjdNMlhKTVQAAwEEAQACIzAwMDAwMAADyAEABQEBAiMwMDAwMDAAA8gBAAAAAgEA). As far as we know the error is occuring on all datasets there.

According to https://docs.dask.org/en/stable/dataframe-categoricals.html a solution could be to use

data[c] = data[c].cat.as_known()

to make the categories visible and then for registration

data[c] = data[c].cat.set_categories(data[c]._meta.cat.categories)

Although, the last step could probably be skipped. This implementation is a bit slower than the previous one.

I'll open a pull request with these change later.

Expected behavior Registration of all categories within points-dataframe.

OS: Linux Ubuntu
Version 0.5.0

Additional context spatialdata_io.readers.merscope with datasets from the Allen Brain Atlas result in transcript-dataframe with ~20-30 genes instead of ~500 genes.

Oct 30 '25 08:10 jonas2612