spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Categories missing with highly partitioned dask dataframes in PointsModel

Open jonas2612 opened this issue 2 months ago • 0 comments

  1. Reproduce using the blobs dataset

    import numpy as np
    import pandas as pd
    import dask.dataframe as dd
    from spatialdata.datasets import blobs
    
    s = blobs()
    tbl = next(iter(s.tables.values()))
    df = tbl.obs.copy()
    n_cats = 15
    cats = pd.Index([f"G{i:04d}" for i in range(n_cats)], dtype="string")
    
    rng = np.random.default_rng(0)
    n = len(df)
    k_front = min(10_000, n)
    front = rng.choice(cats[:20], size=k_front)
    back  = rng.choice(cats, size=n - k_front)
    df["gene"] = pd.Index(np.concatenate([front, back]), dtype="string")
    
    ddf_many = dd.from_pandas(df, npartitions=217)
    
    c1 = ddf_many["gene"].astype(str).astype("category").head(1).cat.categories
    print("many partitions, categories seen via head(1):", len(c1))  # typically ~20
    
    ddf_as_known = ddf_many["gene"].astype("category").cat.as_known()
    print("with .cat.as_known(), categories:", len(ddf_as_known._meta.cat.categories))
    

Describe the bug The setting of the categories in L888 of scr/spatialdata/models/models.py is not taking into account all categories. If the categories are set per partition, not all categories will be properly registered, leading to an inadvertent filtering of the points dataframe and feature loss.

To Reproduce See example above on blobs dataset or datasets of the Allen Brain Atlas (https://knowledge.brain-map.org/abcatlas#AQEBSzlKTjIzUDI0S1FDR0s5VTc1QQACSFNZWlBaVzE2NjlVODIxQldZUAADAAQBAAKEUL8fg4IJfwOFLj12hMQ92QQyTlFUSUU3VEFNUDhQUUFITzRQAAWBr6ZKgemsDoGggUeAktXoBgAHAAAFAAYBAQJGUzAwRFhWMFQ5UjFYOUZKNFFFAAN%2BAAAABAAACFZGT0ZZUEZRR1JLVURRVVozRkYACUxWREJKQVc4Qkk1WVNTMVFVQkcACgALAVRMT0tXQ0w5NVJVMDNEOVBFVEcAAjczR1ZURFhERUdFMjdNMlhKTVQAAwEEAQACIzAwMDAwMAADyAEABQEBAiMwMDAwMDAAA8gBAAAAAgEA). As far as we know the error is occuring on all datasets there.

According to https://docs.dask.org/en/stable/dataframe-categoricals.html a solution could be to use

data[c] = data[c].cat.as_known()

to make the categories visible and then for registration

data[c] = data[c].cat.set_categories(data[c]._meta.cat.categories)

Although, the last step could probably be skipped. This implementation is a bit slower than the previous one.

I'll open a pull request with these change later.

Expected behavior Registration of all categories within points-dataframe.

  • OS: Linux Ubuntu
  • Version 0.5.0

Additional context spatialdata_io.readers.merscope with datasets from the Allen Brain Atlas result in transcript-dataframe with ~20-30 genes instead of ~500 genes.

jonas2612 avatar Oct 30 '25 08:10 jonas2612