Support coloring by boolean variables

Open fbnrst opened this issue 3 years ago • 14 comments

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Minimal code sample (that we can copy&paste without having any data)

import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)

adata.obs['boolean'] = True

sc.pl.pca(adata, color='boolean')

... storing 'blobs' as categorical

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-1415b8dea7b8> in <module>
      5 adata.obs['boolean'] = True
      6 
----> 7 sc.pl.pca(adata, color='boolean')

/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in pca(adata, annotate_var_explained, show, return_fig, save, **kwargs)
    727     if not annotate_var_explained:
    728         return embedding(
--> 729             adata, 'pca', show=show, return_fig=return_fig, save=save, **kwargs
    730         )
    731     else:

/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in embedding(adata, basis, color, gene_symbols, use_raw, sort_order, edges, edges_width, edges_color, neighbors_key, arrows, arrows_kwds, groups, components, layer, projection, scale_factor, color_map, cmap, palette, na_color, na_in_legend, size, frameon, legend_fontsize, legend_fontweight, legend_loc, legend_fontoutline, vmax, vmin, add_outline, outline_width, outline_color, ncols, hspace, wspace, title, show, save, ax, return_fig, **kwargs)
    257         if sort_order is True and value_to_plot is not None and categorical is False:
    258             # Higher values plotted on top, null values on bottom
--> 259             order = np.argsort(-color_vector, kind="stable")[::-1]
    260         elif sort_order and categorical:
    261             # Null points go on bottom

TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.

Versions

scanpy==1.7.0 anndata==0.7.5 umap==0.5.1 numpy==1.20.0 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3

Feb 11 '21 18:02 fbnrst

I think this more of an enhancement than a bug, though an error message saying we don't have a way to color by boolean values would be more clear.

What would you expect this to look like? Which styling options apply here?

Feb 12 '21 00:02 ivirshup

I came across this when I wanted to plot the predicted doublets from scrublet. predicted_doublet is stored as boolean. So, I would like to have this plotted like a categorical. I realized that plotting actually works when using pl.scatter:

import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)

adata.obs['boolean'] = True

sc.pl.scatter(adata, color='boolean', basis='pca')

Feb 12 '21 09:02 fbnrst

I have noticed the same issue. As a workaround you can do

adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')

Feb 12 '21 14:02 fidelram

Thanks. I tried this as well. The problem that I had was that going back to a Boolean for subsetting was not easy:

adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')
adata[adata.obs['boolean'].astype(bool)]

This throws a key error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-3fef793fe5bd> in <module>
----> 1 adata[adata.obs['boolean'].astype(bool)]

/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in __getitem__(self, index)
   1085     def __getitem__(self, index: Index) -> "AnnData":
   1086         """Returns a sliced view of the object."""
-> 1087         oidx, vidx = self._normalize_indices(index)
   1088         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1089 

/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
   1066 
   1067     def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1068         return _normalize_indices(index, self.obs_names, self.var_names)
   1069 
   1070     # TODO: this is not quite complete...

/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
     32             index = index[0].values, index[1]
     33     ax0, ax1 = unpack_index(index)
---> 34     ax0 = _normalize_index(ax0, names0)
     35     ax1 = _normalize_index(ax1, names1)
     36     return ax0, ax1

/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
     99                 not_found = indexer[positions < 0]
    100                 raise KeyError(
--> 101                     f"Values {list(not_found)}, from {list(indexer)}, "
    102                     "are not valid obs/ var names or indices."
    103                 )

KeyError: 'Values [True, True, True, (.... I shorten this part....) True, True, True, True, True, True, True, True, True, True, True, True, True, True], are not valid obs/ var names or indices.'

while this works:

adata[adata.obs['boolean'].astype(bool) == True]

Feb 12 '21 17:02 fbnrst

you can make a new column to avoid overwriting the boolean.

adata.obs['boolean_cat'] = adata.obs['boolean'].astype(str).astype('category')

Feb 12 '21 17:02 fidelram

Thanks. Agreed. This is a workaround for sure.

Feb 12 '21 17:02 fbnrst

Possible solution: just treat boolean as categorical

This would also handle nullable booleans (though we currently cannot write those through anndata).
Color palettes would just be palette={True: "#xxxxxx", False: "#xxxxxx"},.
Default color palette (current behavior for 2 group categories) are orange for true, blue for false
- We could change this, but I'm not sure what a good default is if lightgray is null
Ordering would be done as categorical, not numeric (would it make more sense for True to show up on top of False?)
- Maybe ordering would be good: https://github.com/theislab/scanpy/issues/490#issuecomment-768282049

Feb 18 '21 05:02 ivirshup

Guess we also ran into this https://github.com/theislab/ehrapy/issues/373#issuecomment-1108907602

Possible to handle internally in ehrapy, but maybe a solution in scanpy would be nicer.

Apr 25 '22 18:04 Zethson

I remember liking the color palette used in some slides from @yugeji. Think it might have been blue/ red?

Apr 25 '22 19:04 ivirshup

@ivirshup are you looking for default colors for boolean variables?

Jun 15 '22 16:06 Zethson

Not particularly actively. Suggestions welcome

Jun 15 '22 17:06 ivirshup

yellow & blue like viridis is the gold standard I think. The default of orange & blue should work as well though.

Jun 15 '22 18:06 Zethson

I don't love using viridis for this since:

One color is going to be much closer to the background
Libraries are pretty inconsistent about whether purple or yellow is the high value

Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.

Jun 15 '22 21:06 ivirshup

I don't love using viridis for this since:
* One color is going to be much closer to the background

* Libraries are pretty inconsistent about whether purple or yellow is the high value
Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.

Interesting, thanks. Gets a +1 from me :) We're running into this issue very regularly with ehrapy where such boolean columns are very common.

Jun 15 '22 21:06 Zethson

Having the same issue and appreciate the posted workarounds. I found that if True and False are categorical, it is easy to accidentally treat them as booleans in e.g. a filtering operation. In that case, pandas will silently produce unexpected behavior. Whatever the change, the code should probably account for this e.g.

adata.obs['predicted_doublets'] == True would return False for every value since "True" != True

Feb 20 '23 00:02 michael-swift

scanpy scanpy copied to clipboard

Support coloring by boolean variables

Minimal code sample (that we can copy&paste without having any data)

Versions

Possible solution: just treat boolean as categorical

scanpy
scanpy copied to clipboard