ehrapy
ehrapy copied to clipboard
Support for more data types when writing to .h5ad files
Is your feature request related to a problem? Please describe.
Currently, when using caching, we cannot write some datatypes like booleans in obs
for example.
Workarounds are implemented currently and should be removed, when this is resolved (comments in source code).
Describe the solution you would like
See https://github.com/theislab/anndata/issues/662 for more information on fixing this one.
You can't write bool
s? I expect this to work:
(
ad.AnnData(
np.ones((5, 5)),
obs=pd.DataFrame(
{"bool": np.random.randint(0, 2, size=5, dtype=bool)},
index=[f"cell{i}" for i in range(5)]
),
)
.write_h5ad("w_bool.h5ad")
)
This works. It does not work in cases where I have a bool
dtype column in obs
and some of the values are missing. In general, I cannot write .h5ad
files when there are missing values (like nans
).
Any ideas why this is especially a problem with bool
and non-numerical dtype in general? Since currently, I'm forced to replace every NaN
thats eventually parsed (or missing value) in a "non-numerical" dtype column of X
to an empty string, otherwise I was not able to write to .h5ad
files.
Any ideas why this is especially a problem with bool and non-numerical dtype in general?
This is mostly around dtypes and what numpy supports (as well as hdf5 and zarr) vs pandas.
import pandas as pd, numpy as np
pd.DataFrame({
"np-bool": np.ones(10, dtype=bool),
"pd-bool": pd.array(np.zeros(10, dtype=bool))
}).dtypes
np-bool bool
pd-bool boolean
dtype: object
A column with a np.bool_
column can currently be written. A pd.BooleanDtype
column can't. In addition, both h5py
and zarr
don't have native representations for "null" values for integers or booleans.
Internally, pandas stores a mask to denote which values in a pd.arrays.BooleanArray
or pd.arrays.IntegerArray
are missing.
In general, I cannot write .h5ad files when there are missing values (like nans).
There will be support for more kinds of values in anndata soon. For specifically nullable integer and boolean support, you could try out https://github.com/theislab/anndata/pull/669.
Exactly how pandas is going to do nullable string arrays seems like it will change soon, and anndata largely casts these categorical anyways, so I'm thinking I'll leave that one for now.
A pd.BooleanDtype column can't.
I see, thanks for clarification. Yes, we're using built-in read_csv
from pandas
when parsing .csv
files and that explains the issues here.
For specifically nullable integer and boolean support, you could try out theislab/anndata#669.
Will check this out, might ease our parsing code here a bit, thanks.