scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Support coloring by boolean variables

Open fbnrst opened this issue 3 years ago • 14 comments

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of scanpy.
  • [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Minimal code sample (that we can copy&paste without having any data)

import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)

adata.obs['boolean'] = True

sc.pl.pca(adata, color='boolean')
... storing 'blobs' as categorical

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-1415b8dea7b8> in <module>
      5 adata.obs['boolean'] = True
      6 
----> 7 sc.pl.pca(adata, color='boolean')

/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in pca(adata, annotate_var_explained, show, return_fig, save, **kwargs)
    727     if not annotate_var_explained:
    728         return embedding(
--> 729             adata, 'pca', show=show, return_fig=return_fig, save=save, **kwargs
    730         )
    731     else:

/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in embedding(adata, basis, color, gene_symbols, use_raw, sort_order, edges, edges_width, edges_color, neighbors_key, arrows, arrows_kwds, groups, components, layer, projection, scale_factor, color_map, cmap, palette, na_color, na_in_legend, size, frameon, legend_fontsize, legend_fontweight, legend_loc, legend_fontoutline, vmax, vmin, add_outline, outline_width, outline_color, ncols, hspace, wspace, title, show, save, ax, return_fig, **kwargs)
    257         if sort_order is True and value_to_plot is not None and categorical is False:
    258             # Higher values plotted on top, null values on bottom
--> 259             order = np.argsort(-color_vector, kind="stable")[::-1]
    260         elif sort_order and categorical:
    261             # Null points go on bottom

TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.

Versions

scanpy==1.7.0 anndata==0.7.5 umap==0.5.1 numpy==1.20.0 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3

fbnrst avatar Feb 11 '21 18:02 fbnrst

I think this more of an enhancement than a bug, though an error message saying we don't have a way to color by boolean values would be more clear.

What would you expect this to look like? Which styling options apply here?

ivirshup avatar Feb 12 '21 00:02 ivirshup

I came across this when I wanted to plot the predicted doublets from scrublet. predicted_doublet is stored as boolean. So, I would like to have this plotted like a categorical. I realized that plotting actually works when using pl.scatter:

import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)

adata.obs['boolean'] = True

sc.pl.scatter(adata, color='boolean', basis='pca')

fbnrst avatar Feb 12 '21 09:02 fbnrst

I have noticed the same issue. As a workaround you can do

adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')

fidelram avatar Feb 12 '21 14:02 fidelram

Thanks. I tried this as well. The problem that I had was that going back to a Boolean for subsetting was not easy:

adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')
adata[adata.obs['boolean'].astype(bool)]

This throws a key error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-3fef793fe5bd> in <module>
----> 1 adata[adata.obs['boolean'].astype(bool)]

/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in __getitem__(self, index)
   1085     def __getitem__(self, index: Index) -> "AnnData":
   1086         """Returns a sliced view of the object."""
-> 1087         oidx, vidx = self._normalize_indices(index)
   1088         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1089 

/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
   1066 
   1067     def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1068         return _normalize_indices(index, self.obs_names, self.var_names)
   1069 
   1070     # TODO: this is not quite complete...

/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
     32             index = index[0].values, index[1]
     33     ax0, ax1 = unpack_index(index)
---> 34     ax0 = _normalize_index(ax0, names0)
     35     ax1 = _normalize_index(ax1, names1)
     36     return ax0, ax1

/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
     99                 not_found = indexer[positions < 0]
    100                 raise KeyError(
--> 101                     f"Values {list(not_found)}, from {list(indexer)}, "
    102                     "are not valid obs/ var names or indices."
    103                 )

KeyError: 'Values [True, True, True, (.... I shorten this part....) True, True, True, True, True, True, True, True, True, True, True, True, True, True], are not valid obs/ var names or indices.'

while this works:

adata[adata.obs['boolean'].astype(bool) == True]

fbnrst avatar Feb 12 '21 17:02 fbnrst

you can make a new column to avoid overwriting the boolean.

adata.obs['boolean_cat'] = adata.obs['boolean'].astype(str).astype('category')

fidelram avatar Feb 12 '21 17:02 fidelram

Thanks. Agreed. This is a workaround for sure.

fbnrst avatar Feb 12 '21 17:02 fbnrst

Possible solution: just treat boolean as categorical

  • This would also handle nullable booleans (though we currently cannot write those through anndata).
  • Color palettes would just be palette={True: "#xxxxxx", False: "#xxxxxx"},.
  • Default color palette (current behavior for 2 group categories) are orange for true, blue for false
    • We could change this, but I'm not sure what a good default is if lightgray is null
  • Ordering would be done as categorical, not numeric (would it make more sense for True to show up on top of False?)
    • Maybe ordering would be good: https://github.com/theislab/scanpy/issues/490#issuecomment-768282049

ivirshup avatar Feb 18 '21 05:02 ivirshup

Guess we also ran into this https://github.com/theislab/ehrapy/issues/373#issuecomment-1108907602

Possible to handle internally in ehrapy, but maybe a solution in scanpy would be nicer.

Zethson avatar Apr 25 '22 18:04 Zethson

I remember liking the color palette used in some slides from @yugeji. Think it might have been blue/ red?

ivirshup avatar Apr 25 '22 19:04 ivirshup

@ivirshup are you looking for default colors for boolean variables?

Zethson avatar Jun 15 '22 16:06 Zethson

Not particularly actively. Suggestions welcome

ivirshup avatar Jun 15 '22 17:06 ivirshup

yellow & blue like viridis is the gold standard I think. The default of orange & blue should work as well though.

Zethson avatar Jun 15 '22 18:06 Zethson

I don't love using viridis for this since:

  • One color is going to be much closer to the background
  • Libraries are pretty inconsistent about whether purple or yellow is the high value

Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.

ivirshup avatar Jun 15 '22 21:06 ivirshup

I don't love using viridis for this since:

* One color is going to be much closer to the background

* Libraries are pretty inconsistent about whether purple or yellow is the high value

Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.

Interesting, thanks. Gets a +1 from me :) We're running into this issue very regularly with ehrapy where such boolean columns are very common.

Zethson avatar Jun 15 '22 21:06 Zethson

Having the same issue and appreciate the posted workarounds. I found that if True and False are categorical, it is easy to accidentally treat them as booleans in e.g. a filtering operation. In that case, pandas will silently produce unexpected behavior. Whatever the change, the code should probably account for this e.g.

adata.obs['predicted_doublets'] == True would return False for every value since "True" != True

michael-swift avatar Feb 20 '23 00:02 michael-swift