anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Nullable string columns

Open ivirshup opened this issue 2 years ago • 12 comments

Split off from #504

It would be nice to have support for nullable string arrays. It would be good to have a consistent in-memory representation for these so we can reason about performance. However, this does not currently exist in our dependency stack. I currently think this feature will be dependent on upstream developments in pandas StringArray type.

This is less urgent than nullable integers and booleans since we already have nullable categorical arrays, and currently aggressively cast strings to categorical for performance reasons.

ivirshup avatar Jan 11 '22 21:01 ivirshup

Pandas now has multiple ways of doing strings, custom and via arrow.

I think we could still handle both these cases by just storing a mask. This would just be a little inefficient, but we can always update.

Maybe we could even handle arrow bit masks if those seem to be the path forward (docs for bit masks, docs for np.packbits)

ivirshup avatar Sep 16 '22 12:09 ivirshup

This is also relevant for the output of sc.tl.filter_rank_genes_groups in scanpy, which makes some of the genes in the newly created uns part nan:

adata.uns['rank_genes_groups_filtered']['names'][0]
(nan, nan, 'NKG7', nan, nan, 'PPBP')

I'm adding it here since this seems to be related to the issue of h5py > 3.0 not being happy with casting non strings to strings:

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

Let me know if you prefer me to open a new issue on Scanpy.

pcm32 avatar Dec 01 '22 10:12 pcm32

This issue wouldn't apply for rank genes groups because that object is a record array, while this issue addresses dataframe columns specifically.

ivirshup avatar Dec 01 '22 16:12 ivirshup

No problem, should I open a new one here or on scanpy?

pcm32 avatar Dec 01 '22 16:12 pcm32

@pcm32, sorry for the late response here. Came at a busy time of year.

I believe there will already be issues open on scanpy for this.

ivirshup avatar Feb 28 '23 14:02 ivirshup

About implementation for nullable string support:

This is somewhat complicated by pandas having multiple backends for nullable string arrays (pyarrow and pd.StringDtype).

We probably want to go with an on disk representation of arrays similar to the arrow in memory representation, but it seems configurable in pandas whether we get the pyarrow representation or the pandas rep. We also don't want to add a hard dependency on pyarrow.

I'm also not sure how we can go from the pandas representation to something writable. We can easily get the masks (.isna()), but I don't know what to handle the "data" containing pd.NA values. I think we probably just make a copy of the array and fill replace the NA entries with some other string, but idk that there's a great choice here.

ivirshup avatar Feb 28 '23 14:02 ivirshup

This is starting to come up more frequently, and will likely be even more of an issue with the next release of pandas (which is coming soon).

To head that off, I think I'm going to add this feature for 0.9

ivirshup avatar Mar 01 '23 13:03 ivirshup

Thank you for this update, is there an estimated timeline for this issue to be patched? I'm facing it exactly where you explained- due to nan in ranked gene lists.

nroak avatar Mar 08 '23 19:03 nroak

OK, to be clear: This issue means support for pandas.core.arrays.string_.StringArray as described in #963, right?

flying-sheep avatar Jun 12 '23 09:06 flying-sheep

had a similar issue when trying to concatenate objects. one dataset has a boolean obs column, the other does not. when they are combined it becomes True/False/NaN. I agree that it's up to the user to determine and explicitly define how they want to handle this.

For me, this is 1 step in a longer pipeline for data exploration. It would be nice to have a lazy save here where I concatenate multiple datasets without explicitly resolving which columns will be informative later on.

MishalAshraf avatar Jul 17 '23 17:07 MishalAshraf

milestone was missed, bumping to 0.10.0

flying-sheep avatar Jul 21 '23 09:07 flying-sheep