anndata icon indicating copy to clipboard operation
anndata copied to clipboard

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays

Open grst opened this issue 1 year ago • 9 comments

Please describe your wishes and possible alternatives to achieve the desired result.

Since #504, AnnData supports nullable int and bool columns in obs. Support for strings is planned in #679.

However, this only works if the nullable columns are represented as the appropriate pandas Array extension type.

For instance this

import anndata
import numpy as np
import pandas as pd

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame().assign(
        test_int=np.array([1, 2, None, 3]),
        test_bool=[True, False, None, False],
    ),
)
adata.write_h5ad("test.h5ad")

fails with TypeError: Can't implicitly convert non-string objects to strings.

After converting the columns to pandas arrays, the object can be saved:

for c in adata.obs.columns:
    adata.obs[c] = pd.array(adata.obs[c].values)
adata.write_h5ad("test.h5ad")

Unfortunately, the pandas extension arrays are little known and Nones might end up in adata.obs for various reasons (for instance https://github.com/scverse/scirpy/issues/434).

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save? Or maybe there should be an equivalent to AnnData.strings_to_categoricals that can be called to sanitize such columns?

grst avatar Jul 23 '23 18:07 grst