cellxgene
cellxgene copied to clipboard
Density / Z-Stack Planning
Put together by @liaprins-czi @sidneymbell @colinmegill
Problem: cell render overlap prevents user from quickly gauging cell count and comparison
Sources of problems
- Cells are rendered too large
- Cells are rendered on top of each other
- Cells are not transparent
- Cells are not rendered randomly but in layers which may completely obscure clusters
- (variable) DATASET SCALE - getting to 10 million cells
Non-mutually exclusive solutions
✅ For continuous data:
✅ Dot size - ie., smaller and adaptively so
-
Figma mocks; GitHub #759; Bruce to implement general formula and we can play around with these constants:
- Max and min optimal screen sizes
- Max and min optimal cell counts
- Max and min optimal dot diameter
- Shuffle points to avoid draw order overlap issues (insufficient)
✅ Z-Stack cleverness
- https://github.com/chanzuckerberg/cellxgene/issues/817
- Insufficient by itself but helpful
✅ Contour
- https://github.com/chanzuckerberg/cellxgene/issues/597
- If exists, should be a toggle. Could be done such that it was a topological map that didn't interfere with coloring.
- https://github.com/d3/d3-contour
✅ Alpha
- https://github.com/chanzuckerberg/cellxgene/issues/657
- Used to exist, should exist again?
🤔 Binning,
- like hexbins
- https://github.com/chanzuckerberg/cellxgene/issues/861
- https://github.com/theislab/paga
- Big tradeoff / loss of information (and requires pre-computation of the minimum spanning tree)
🤔 Blending (see Figma)
🤔 Heat like https://github.com/chanzuckerberg/cellxgene/issues/597
- Interferes with color model
💖Cloud, like https://github.com/chanzuckerberg/cellxgene/issues/861
- I.e., grouping like-cells together into 'metacells' (analogous to plotting quintiles to surface relationships between noisy variables). This could be cool and Sidney would have fun, but would be more complicated because it entails a lot of decisions about the data (which may/not be able to be generalized in an agnostic way)
- HCA scale data?
- Neighbor graph precomputed by users, traversal of graph is very fast
- Could be implemented as a 'granularity' slider
Figma boards
We also had a user request the sort_order=False
behavior from scanpy. The behavior is described here:
For continuous annotations used as color parameter, (do not) plot data points with higher values on top of others.
@colinmegill and I discussed a few additional potential mitigations to embedding density issues. Some issues can be mitigated by tuning umap and tsne parameters (see e.g. this blog post about tsne). Mitigations of this flavor are not relevant to users who are already using these algorithms optimally.
Potential mitigations:
- Teach scientists to build manifolds that are better structured in the spaces they're projected.
- Improve manifold projections defaults so scientists don't need to learn (estimate manifold volume, base parameters on total points, other).
- Give scientists more space to interact with manifolds to increase the size of data needed to reach the failure point (3d plotting)
- Let scientists iteratively zoom into manifolds with subsetting (inadequate if there are depth imperfections, as @colinmegill notes)
Hello, is there any update on this? I think of the many mentioned approaches, providing users a slider or a value input to adjust the dot size might be relatively simple and effective? Combined with the zoom function already there, it should handle both numerical and categorical coloring well.
Hi @nh3 — yes, a slider for size and a slider for alpha is one of the lowest barriers to tech implementation as well, I agree. We might try to get that in sooner @ambrosejcarr as it's a relatively small change to the codebase.
@mukamel-lab thanks for your comments about opportunities to improve the quality of the visual by addressing crowding in large datasets (> 100,000 cells). @signechambers1 and team are currently thinking about implementing an alpha slider that will enable the user to select a level of transparency for points in the scatterplot. We think this will help provide more insight into dense regions of the embedding. Do you think this capability address the issues you were observing?
Hi @ambrosejcarr - yes, that could be a good solution. Downsampling cells might also be useful; once the number of cells is greater than can be appreciated discretely (because of pixel size or because of human visual cognitive capacity), it becomes difficult to use a scatter plot.
https://www.r-bloggers.com/2012/10/from-holey-polygons-to-convex-hulls/
These are issues we've also discussed around scanpy a few times https://github.com/theislab/scanpy/issues/1263#issuecomment-761745895. Definitely a hard problem with no obvious one-size-fits-all solution! Some thoughts I've had that might be useful here:
Transparency
It's very hard to have a sense of equivalent transparency between different hues. Overlapping hues with varying transparency can also create ambiguity. Because of this, I don't really like alpha levels as a solution.
Point size
I think smaller points are quite effective. Especially with embeddings like UMAP where there should be a minimum distance between points anyways. This can work well up until you start hitting hitting pixel densities. It would be nice to have a bit more manual control over this in the front end. Points tend to overlap when "selected" (and all points are selected by default):

However, small points in sparse areas (especially with light hues) can be quite difficult to see. Some sort of shadowing around borders of points and canvas can help here. I saw some very nice examples of this from the Trapnell lab recently.
Binning (like datashader)
Binning and summarizing is an okay solution, but I have no idea what the right way to summarize categoricals is. See the linked scanpy issue for much discussion.
Small multiples
Small multiples are very useful to make sure nothing is hidden – though you can lose a fair amount of information density. You can sorta get this effect by mousing over the labels (for multiples faceted by category), but the points become very large at the moment:

Small multiples can also be a good solution for large numbers of categories, since you can get around needing all colors to be discernible. Also a boon for color blind users.
Plus Tufte has a whole chapter on it, so can't go wrong 😉
Here's my stance on everything mentioned here. I may add more to this if I notice any issues I've left out.
For continuous data:
- #1020
- I like the idea of randomly shuffling the cells, although one obvious caveat is to make sure that the shuffling is seeded so that the plots look the same across CXG instances.
- I don't think sorting by expression is crucial to more easily visualize cells expressing a particular gene because that's solved by the brushable histograms (which are amazing, btw).
Dot size - ie., smaller and adaptively so
- Great idea for making zooming in/out more sensible and seems like it's already implemented.
Z-stack / Density issues (Addresses #817 #657 #597 #861)
-
Rank clusters by size ( #817 )
- Hovering over a cluster on the siderbar already pops the cluster to the front to make it easily visible.
- An edge case is if a group of cells erroneously look like like they are only one color which prevents a user from zooming in and identifying a mixture of multiple clusters.
- Ranking by size is insufficient if you have multiple clusters of similar sizes overlapping each other.
- It would be amazing if hovering over a cell in the scatterplot would show a tooltip with the currently-selected categorical/continuous label. Very beneficial for colorblind users too, as they may not be able to tell which colors in the scatterplot correspond to which colors on the categorical sidebar.
-
Alpha / Density / Densmap ( #657 #597 )
- As @ivirshup mentioned, using alpha can non-uniformly skew the colors for categorical labels.
- In general, the density only tells you how much of your dataset concentrates in a particular part of the manifold. But that's already communicated by the cell count next to each label in the categorical sidebar.
- Personally, I think Densmap makes UMAPs look worse by accentuating potentially unimportant information in the data (e.g. how much of a particular cell type do you have). I think this information can be more cleanly communicated by the categorical sidebar. Plus, Densmap is much slower than UMAP as its objective function is more complex. - Overall, adding an alpha slider is the easiest solution and allows users to see quickly visualize how cells are concentrated across the manifold without needing to cluster/label the data. But I don't think anything more than that needs to be done.
-
Coarse graning (PAGA/MetaCell) ( #861 )
- An unparalleled strength of CXG in the space of single-cell explorers is its ability to scale to millions of cells. It's a shame to sacrifice this in favor of looking at coarse-grained representations.
- Coarse-graining comes with its own set of UX problems. Graph partition methods rely heavily on the clustering methods used, which are almost always parameter-intensive. You can never be user that some rare cell types didn't properly segragate in their own cluster.
- Most graph partition methods suffer from class imbalances (cell types that are sampled more heavily tend to split up into more subclusters than cell types that are sampled less).
- Complex manifolds with differentiation trajectories are especially sensitive to the chosen level of graininess.
- Overlaying the coarse-grained representation on top of the UMAP makes it too cluttered.
TL;DR - In my opinion, most of the issues and sub-issues aggregated here are completely solved by interactivity and thus should be low priority if CXG's primary use case is not to generate static, publication-quality images. The only item on my wishlist here about improving interactivity is to add a plotly-esque hover tooltip to show the continuous/categorical label for cells I'm hovering over. This is important as with many clusters it may not be clear which color corresponds to which label.