scanpy
scanpy copied to clipboard
Support coloring by boolean variables
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.
Minimal code sample (that we can copy&paste without having any data)
import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)
adata.obs['boolean'] = True
sc.pl.pca(adata, color='boolean')
... storing 'blobs' as categorical
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-1415b8dea7b8> in <module>
5 adata.obs['boolean'] = True
6
----> 7 sc.pl.pca(adata, color='boolean')
/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in pca(adata, annotate_var_explained, show, return_fig, save, **kwargs)
727 if not annotate_var_explained:
728 return embedding(
--> 729 adata, 'pca', show=show, return_fig=return_fig, save=save, **kwargs
730 )
731 else:
/opt/conda/lib/python3.7/site-packages/scanpy/plotting/_tools/scatterplots.py in embedding(adata, basis, color, gene_symbols, use_raw, sort_order, edges, edges_width, edges_color, neighbors_key, arrows, arrows_kwds, groups, components, layer, projection, scale_factor, color_map, cmap, palette, na_color, na_in_legend, size, frameon, legend_fontsize, legend_fontweight, legend_loc, legend_fontoutline, vmax, vmin, add_outline, outline_width, outline_color, ncols, hspace, wspace, title, show, save, ax, return_fig, **kwargs)
257 if sort_order is True and value_to_plot is not None and categorical is False:
258 # Higher values plotted on top, null values on bottom
--> 259 order = np.argsort(-color_vector, kind="stable")[::-1]
260 elif sort_order and categorical:
261 # Null points go on bottom
TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.
Versions
scanpy==1.7.0 anndata==0.7.5 umap==0.5.1 numpy==1.20.0 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
I think this more of an enhancement than a bug, though an error message saying we don't have a way to color by boolean values would be more clear.
What would you expect this to look like? Which styling options apply here?
I came across this when I wanted to plot the predicted doublets from scrublet. predicted_doublet
is stored as boolean. So, I would like to have this plotted like a categorical. I realized that plotting actually works when using pl.scatter
:
import scanpy as sc
adata = sc.datasets.blobs()
sc.pp.pca(adata)
adata.obs['boolean'] = True
sc.pl.scatter(adata, color='boolean', basis='pca')
I have noticed the same issue. As a workaround you can do
adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')
Thanks. I tried this as well. The problem that I had was that going back to a Boolean for subsetting was not easy:
adata.obs['boolean'] = adata.obs['boolean'].astype(str).astype('category')
adata[adata.obs['boolean'].astype(bool)]
This throws a key error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-26-3fef793fe5bd> in <module>
----> 1 adata[adata.obs['boolean'].astype(bool)]
/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in __getitem__(self, index)
1085 def __getitem__(self, index: Index) -> "AnnData":
1086 """Returns a sliced view of the object."""
-> 1087 oidx, vidx = self._normalize_indices(index)
1088 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1089
/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
1066
1067 def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1068 return _normalize_indices(index, self.obs_names, self.var_names)
1069
1070 # TODO: this is not quite complete...
/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
32 index = index[0].values, index[1]
33 ax0, ax1 = unpack_index(index)
---> 34 ax0 = _normalize_index(ax0, names0)
35 ax1 = _normalize_index(ax1, names1)
36 return ax0, ax1
/opt/conda/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
99 not_found = indexer[positions < 0]
100 raise KeyError(
--> 101 f"Values {list(not_found)}, from {list(indexer)}, "
102 "are not valid obs/ var names or indices."
103 )
KeyError: 'Values [True, True, True, (.... I shorten this part....) True, True, True, True, True, True, True, True, True, True, True, True, True, True], are not valid obs/ var names or indices.'
while this works:
adata[adata.obs['boolean'].astype(bool) == True]
you can make a new column to avoid overwriting the boolean.
adata.obs['boolean_cat'] = adata.obs['boolean'].astype(str).astype('category')
Thanks. Agreed. This is a workaround for sure.
Possible solution: just treat boolean as categorical
- This would also handle nullable booleans (though we currently cannot write those through
anndata
). - Color palettes would just be
palette={True: "#xxxxxx", False: "#xxxxxx"},
. - Default color palette (current behavior for 2 group categories) are orange for true, blue for false
- We could change this, but I'm not sure what a good default is if
lightgray
is null
- We could change this, but I'm not sure what a good default is if
- Ordering would be done as categorical, not numeric (would it make more sense for
True
to show up on top ofFalse
?)- Maybe ordering would be good: https://github.com/theislab/scanpy/issues/490#issuecomment-768282049
Guess we also ran into this https://github.com/theislab/ehrapy/issues/373#issuecomment-1108907602
Possible to handle internally in ehrapy, but maybe a solution in scanpy would be nicer.
I remember liking the color palette used in some slides from @yugeji. Think it might have been blue/ red?
@ivirshup are you looking for default colors for boolean variables?
Not particularly actively. Suggestions welcome
yellow & blue like viridis is the gold standard I think. The default of orange & blue should work as well though.
I don't love using viridis for this since:
- One color is going to be much closer to the background
- Libraries are pretty inconsistent about whether purple or yellow is the high value
Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.
I don't love using viridis for this since:
* One color is going to be much closer to the background * Libraries are pretty inconsistent about whether purple or yellow is the high value
Right now, I'm leaning blue for False, red for True. It's a common divergent palette and people will be used to this from DE plots.
Interesting, thanks. Gets a +1 from me :) We're running into this issue very regularly with ehrapy where such boolean columns are very common.
Having the same issue and appreciate the posted workarounds. I found that if True and False are categorical, it is easy to accidentally treat them as booleans in e.g. a filtering operation. In that case, pandas will silently produce unexpected behavior. Whatever the change, the code should probably account for this e.g.
adata.obs['predicted_doublets'] == True
would return False for every value since "True" != True