ehrapy icon indicating copy to clipboard operation
ehrapy copied to clipboard

Support for more data types when writing to .h5ad files

Open Imipenem opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe.

Currently, when using caching, we cannot write some datatypes like booleans in obs for example.

Workarounds are implemented currently and should be removed, when this is resolved (comments in source code).

Describe the solution you would like

See https://github.com/theislab/anndata/issues/662 for more information on fixing this one.

Imipenem avatar Dec 12 '21 14:12 Imipenem

You can't write bools? I expect this to work:

(
    ad.AnnData(
        np.ones((5, 5)),
        obs=pd.DataFrame(
            {"bool": np.random.randint(0, 2, size=5, dtype=bool)},
            index=[f"cell{i}" for i in range(5)]
        ),
    )
    .write_h5ad("w_bool.h5ad")
)

ivirshup avatar Dec 21 '21 15:12 ivirshup

This works. It does not work in cases where I have a bool dtype column in obs and some of the values are missing. In general, I cannot write .h5ad files when there are missing values (like nans). Any ideas why this is especially a problem with bool and non-numerical dtype in general? Since currently, I'm forced to replace every NaN thats eventually parsed (or missing value) in a "non-numerical" dtype column of X to an empty string, otherwise I was not able to write to .h5ad files.

Imipenem avatar Dec 22 '21 20:12 Imipenem

Any ideas why this is especially a problem with bool and non-numerical dtype in general?

This is mostly around dtypes and what numpy supports (as well as hdf5 and zarr) vs pandas.

import pandas as pd, numpy as np

pd.DataFrame({
    "np-bool": np.ones(10, dtype=bool),
    "pd-bool": pd.array(np.zeros(10, dtype=bool))
}).dtypes
np-bool       bool
pd-bool    boolean
dtype: object

A column with a np.bool_ column can currently be written. A pd.BooleanDtype column can't. In addition, both h5py and zarr don't have native representations for "null" values for integers or booleans.

Internally, pandas stores a mask to denote which values in a pd.arrays.BooleanArray or pd.arrays.IntegerArray are missing.

In general, I cannot write .h5ad files when there are missing values (like nans).

There will be support for more kinds of values in anndata soon. For specifically nullable integer and boolean support, you could try out https://github.com/theislab/anndata/pull/669.

Exactly how pandas is going to do nullable string arrays seems like it will change soon, and anndata largely casts these categorical anyways, so I'm thinking I'll leave that one for now.

ivirshup avatar Dec 25 '21 19:12 ivirshup

A pd.BooleanDtype column can't.

I see, thanks for clarification. Yes, we're using built-in read_csv from pandas when parsing .csv files and that explains the issues here.

For specifically nullable integer and boolean support, you could try out theislab/anndata#669.

Will check this out, might ease our parsing code here a bit, thanks.

Imipenem avatar Dec 25 '21 21:12 Imipenem