Interactively rendering ~10 million points using datashader
🚀 Feature
Hello everyone, I'm not sure what work's been done on optimizing the Points layer, but I thought I'd try and give a go at using datashader to render into an image that napari can handle.
Motivation
We work with large segmented images that have millions of cells, and visualizing them somehow would be great.
Pitch
I threw together a script as a proof of concept, with the gif below:
Here's the script. I'm not really familiar with vispy or napari Layers, so please let me know if you see something that doesn't fit the intended API.
Edit: I moved the script to this gist
A few notes:
- The above is of 1e7 points, but it gets a little laggy. 1e6 points is pretty close to 60 fps.
- I have also tried rasterizing polygons with datashader, but it is significantly slower.
- datashader's shade and dynspread can improve the interpretation of dense parts of the image, but I didn't include them here because they ended up being too slow.
- Although datashader uses numba-optimized python code, maybe some better algorithms/dask parallelization on the spatialpandas/datashader side can improve performance.
- At high resolution, the individual points appear as single pixels; maybe it'd be possible to seamlessly switch between datashader's downsampled view and an alternative view with better rendering of points at certain zoom levels.
I'd love to hear your thoughts!
Hi @ahnsws,
I am sorry that so far we didn't reply yet. Looks awesome, thank you! Would you be willing to join one of the community meetings some time to discuss? You can find the schedule here. Also, we have a zulip where I also post the agenda just a bit before the meeting to which people can freely add items.
To echo what @melonora said: I'm one of the rocket emoji, but forgot to followup! I hope you can make a community meeting to give us more context. Looking at the datashader docs a smidge I almost see this as a different layer type than Points? More of a heat map? Could be really amazing to represent certain features, like classifier output, particularly on multiscale images?
Hi @melonora, I'd be happy to! Although the one tomorrow at 3am EST (my timezone) might be hard to join, so I will plan on joining the one next week (on 9/27).
Hi @psobolewskiPhD, yes exactly, this layer is kind of in between an Image and a Points... because there are so many more points than pixels when zoomed out, datashader acts as a 2D histogram, and I believe this is actually the intended use case. But I think it would be very useful if it acted as a true Points layer when zoomed in enough to see individual points.
I don't see why you couldn't use this to visualize other dynamically computed features based on the viewport dimensions and position; basically, anything that has the signature of the _update_draw method. Actually, I had hoped that millions of polygons (and not just points) could be visualized, which datashader does support, but is too slow for the time being, even with spatialpandas and its fancy spatial indexing. I wonder what qupath is doing there, since it is capable of buttery smooth interactivity even with millions of polygons.
Here are some docs that inspired me:
@ahnsws Awesome as I won't be able to join today either haha. I will add you to the agenda for next week.
I am familiar with the datashader in interactive plotting libraries like bokeh. I noticed there the same thing that it can get laggy when going above 10 million pixels. I could be wrong here but I believe this is because for creating the pixel buffer rendering it still depends on the whole dataset. I see the point of it being inbetween Image and Points, though ultimately the data source is still points. @andy-sweet @kephale I think it would be interesting to see how this compares to tiled rendering / async. In any case the approach of datashader wouldn't require a chunked data format. What do you think?
Hi @ahnsws, just as a reminder, tomorrow at 8:30 am PT. Hope to see you there!
Agenda: https://hackmd.io/BXWDZ3i8Q6OAEASrkaSNIQ Zoom link: https://czi.zoom.us/j/759708263
Hi @melonora, yes planning on it, thank you for the links!
@melonora thanks for inviting me to the community meeting.
Here is a gist for reproducing the polygons example, using the NYC buildings dataset (zip; download warning).
Edit: the dataset has around 1 million polygons, and performance seems to depend on the datashader aggregation method.
And to clarify in general, the code just asks napari for the canvas dimensions, then asks datashader to make a 2D histogram of the geometries, and then asks napari to render that histogram directly, without any transforms.
Thanks @ahnsws for showing this off at the community meeting today and nice to meet you.
I am curious where the bottleneck is in napari, so started looking into it (points layer specifically). I can't even create a points layer in reasonable time with 10M points 😅. However just playing around it seems that vispy can handle it.
For example try this - first create a layer with only the first 10k points:
import numpy as np
qt_viewer = viewer.window.qt_viewer
rng = np.random.default_rng(seed=0)
N = 10_000_000
points = rng.standard_normal(size=(N, 2)) * 1000
# only tell napari about the first 0.1% of points
layer = viewer.add_points(points[:int(0.001 * N)])
then in the napari console update the visual directly to show all the points:
visual = qt_viewer.layer_to_visual[layer]
visual.node._subvisuals[0].set_data(points[:, ::-1])
This remains about as fast as the datashader example for me and suggests there may be a way to achieve this level of interactivity with fewer changes. I think some more tweaks (perhaps in vispy) could also improve the appearance.
Anyway as mentioned above adding some type of support for the (very nice looking) datashader representations has other benefits/applications, but I found this interesting.
I played around a bit with the image layer data protocol approach and got something kind of working. It's a little shorter than the approach above, but has some issues and also needs to access parts of the private API, so it's not really any better (maybe actually worse!).
Agreed with @aganders3 that there's something suspicious about the surprisingly slow behavior in napari. I poked around a bit more and did some profiling and found the main offending line to be in the symbols setter which is called in Points.__init__. The setter took an average of 11.2 seconds for 10^6 points - I didn't try for 10^7. Replacing that line with just self._symbol = symbol makes the setter take an average of 11.8 microseconds with the same number of points. I'll write up a separate issue to track that.
After changing that line, interacting with 10^6 points is pretty good on my machine (2020 Intel macbook pro). 10^7 kinda works, but it's a real struggle to do much. Whereas the example above works pretty well for me!
Just out of curiosity, I tested 10M points in 0.5.6rc0. My script:
import napari
import numpy as np
import time
NUM_POINTS=10_000_000
rng = np.random.default_rng(seed=0)
points = rng.standard_normal(size=(NUM_POINTS, 2)) * 1000
viewer = napari.Viewer()
start = time.time()
layer = napari.layers.Points(points)
create = time.time()
print("creating layer:", create - start)
viewer.add_layer(layer)
end = time.time()
print("Added layer:", end - create)
napari.run()
results on my 2020 M1:
creating layer: 1.658944845199585
Added layer: 6.519549131393433
Not bad at all!
Pan is a bit slow, zoom is pretty fine. Selection is fine. But moving a point blocks.
(If you change to 1M points then everything is instant except moving a point, which is pretty laggy but doesn't block)
I've had some time to revisit this, especially now that we are attempting to do some spatial transcriptomics analysis with napari, and I was able to adapt the above code to visualize ~530M spots from a Nanostring CosMx run we did a while ago. It's eyewateringly slow (~10 seconds per frame, dropping to ~2 seconds when zoomed in), even with dask+parquet+spatialpandas, and despite my best efforts it still blocks the main thread, but it's an interesting way to visualize the data nonetheless, especially on top of registered image data. There's probably low-hanging fruit with regards to a more efficient method of spatial indexing somewhere. However, I'm honestly not sure of the utility of needing to interact with ~1e8 points this way...
Zoomed in:
I was inspired by this post from Holoviz.
I think for ~1e7 points, the way @psobolewskiPhD had demonstrated above sounds perfect; perhaps having a "non-editable" mode to avoid inadvertently moving a point would be a nice-to-have.
Pinging @melonora as much of his focus right now is in this area, and I'd guess he has some thoughts. Also maaaybe @brisvag if I understand some of what he talked about today at the community meeting correctly.
This looks like a tough challenge that I have little knowledge on, so sorry I can't say much more! Besides woohoo science!
@ahnsws Very awesome! I have end users I support at my day job that would be very interested in this. When you say:
I was able to adapt the above code to visualize ~530M spots from a Nanostring CosMx
which code do you mean?
@psobolewskiPhD ah I see, I was referring to this gist, but it's a bit outdated now.
Edit: here is the updated gist using the Marine AIS dataset just to recreate a very common workflow I encounter, where I get a very large dataset as one or more csv files. One issue I had was that the y-axis is positive down in napari, so all the routes are also upside down (clearly) (edit: I tried playing around with the vispy scene transform, and I was able to invert the scene, but the mouse interactions remained not inverted). If anyone spots any inefficiencies, please do let me know!
One issue I had was that the y-axis is positive down in napari, so all the routes are also upside down
Ah, something I can help with. There is wonderfully great news for you that makes this WAY easier. Thanks to @jni, napari 0.6.0 ships with camera orientation for exactly this reason. Check out the alpha to play with it (0.6.0a0 just contains the API) 6.0.0a1 is coming soon.
You are also welcome to play with all the orientation stuff on main. #7663 adds the API (which is in 6.0.0a0), #7686 a GUI to play with this in the viewer, #7770 the handedness property, and #7784 the orientation to preferences
For you, you'll want viewer.camera.orientation = ('towards', 'up', 'right') I'm pretty sure, though we've flipped things a bit since 0.5.6
Hi @TimMonko, thanks for the info! I just tried viewer.camera.orientation = ('towards', 'up', 'right') with 0.6.0a0 and unfortunately the scene goes purple (with viridis applied). I'll play around with it more and open a separate issue if needed; it's likely some bounds are just swapped (edit: for me specifically, not in general). Not a big deal though since this specific dataset was just for testing. Thanks!
I think the colormap issue due to the calculation of x_range and y_range. It works with
x_range = (min(minx, maxx), max(minx, maxx))
y_range = (min(miny, maxy), max(miny, maxy))
but the data itself doesn't flip with the camera orientation. The data flips, but so does the entire canvas, I think. It's something that likely isn't hooked up like in other layers. A bit above me though ❤ I can look into it as well if you have any specific questions.
Partly just wanted to drop in that color fix so Juan didn't panic that orientation is broken. (or maybe it is, but like you said, probably the layer)
Yeah I agree, the way I've coded the boundaries in physical/world coordinates is a bit hacky. The geometries are already specified in physical/world coordinates, so I think the way I wrote how the corner_pixels are obtained needs to change with how napari makes them.
Pinging @melonora as much of his focus right now is in this area, and I'd guess he has some thoughts. Also maaaybe @brisvag if I understand some of what he talked about today at the community meeting correctly.
This looks like a tough challenge that I have little knowledge on, so sorry I can't say much more! Besides woohoo science!
Mmm not sure this is, though I admit I didn't follow the whole conversation closely and I'm not familiar with datashader and why it's being used here :P For context, @TimMonko is referring to my planned work on instanced rendering (which should speed up rendering of a lot of similar objects), but this does not apply to points which are ffectively already using a similar system.
Thanks Lorenzo, I was more referring to 'trying to render lots of objects with applied transforms' which I know datashader isn't doing but maybe you had advice or ideas for trying to render effectively 530M objects on the canvas
Both instanced rendering and datashading have their different use cases. Here I would say that datashading is more suitable if the person viewing the data knows that datashading is being applied. A colormapping indicating how many gene targets are aggregated in a particular area is more informative then trying to show each particular gene target.
Maybe broader discussion would be should this be part of a plugin or should this be in core. I don't have time to play around with it now, but very cool! In case people think it should live in a plugin, I could incorporate it in napari-spatialdata with credit going to you @ahnsws of course. Personally I still doubt a bit as datashading does alter the way we actually see the data. It is not like the work of @Czaki with shapes which is a great speed improvement but then still showing the data the same way.
@ahnsws if you would like to give SpatialData (https://spatialdata.scverse.org/en/stable/) a try for your data let me know! Would be happy to help out and would also appreciate any feedback
Hello all!
I did this initially out of curiosity, since when I first made this issue, it wasn't possible to plot 10M points in napari; now, as @psobolewskiPhD showed, this can now be done out of the box. However, now that we are regularly working with 100M and 1B points, and since it looked like datashader could manage this, I thought I'd give it a go.
Hi @melonora, I think this would work well as a plugin. I'd be more than happy to continue working on making this more efficient, as being able to render 100M-1B points this way is just so incredibly useful to us. If there are any suitable alternatives, please let me know (since I'd rather be working on the spatial biology itself haha...)
As for interpreting the datashaded image, it is effectively a distributed 2D histogram, and so I'd imagine it comes with the benefits and pitfalls of one. This post from datashader goes into some of the downsides of plotting so many points that they overlap each other, hiding the distribution. As seen in @psobolewskiPhD's image with 10M points, it is unclear just how dense things are in the middle of the cluster, and without the code, I wouldn't have known that it was normally distributed. I agree with @melonora that for our particular use case, the distribution of points is indeed more informative than the individual points themselves.
Yes, I have heard of spatialdata! I haven't had a chance to use it yet but it looks very promising now that we have a few dozen CosMx datasets so far (with more on the way).
Great, let me know if you would like to set a short meeting if you want to know about some of the developments we are planning or in general have questions and to discuss the uptake of this in a plugin!
I think that since the core of this issue has been resolved (for me, ~50M points using v0.5.6 is slightly sluggish but still absolutely works), I'll close this issue. I think if there is possible future work on dedicated out-of-memory scatter-plot rendering, that should deserve its own issue. Thank you everyone for the discussion!
I'm going to reopen because the datashader approach is cool and workable. Open issue is easier to find!
Agreed. The minimum thing that would close this for me is a PR adding a napari-datashader example to the gallery.
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/visualisation-of-images-and-million-of-dots-in-napari/45512/8
Hello, I thought I'd revisit this. With some help from Claude Code, I was able to get 2D point clouds of sizes of up to 3e8 to work using napari, without using datashader. Here is a video of the marine AIS dataset (~212M points):
https://github.com/user-attachments/assets/ed49b58e-61e3-4241-915e-8e0e6793efa3
Here is a standalone gist which has 3e8 points.
The GPU I'm using is an NVIDIA GeForce RTX 3070 Ti (8192 MiB). The number of points able to be rendered seems to be determined by the amount of GPU memory, which makes total sense. Essentially, we had to use minimal shaders to be able to fit the dataset onto the GPU; if certain features are removed (e.g. borders, symbols, selecting/editing points), some performance can be gained; in the gist, changing face color, opacity, and marker size are supported. I suspect different colors per point can be done efficiently somehow using lookup tables, but I haven't investigated that further; I suppose I'd just add another layer with a different color if it were another class of points (for now). Edit: this would still require all points across classes to be on the GPU, so the point is moot hm.
I should note that there is some slight jitter in the points at high zoom in the video, which I don't see in the standalone gist.
I also tried rendering ~5e8 and ~1e9 points, but it resulted in a significantly laggier machine; maybe the dataset ended up being stored in CPU memory.
It's not buttery smooth (~100-500 ms per frame, depending on the size of the napari window), but it is much faster than the previous ~10 seconds per frame using datashader (I tried incorporating cupy/cudf, but it didn't really result in substantial performance gains).
Nice! thanks for the update! @brisvag @aganders3 I wonder if any of that can be applied to napari/vispy? I think a programmatic only, read-only-ish (no selection/highlights) layer like this makes some sense. Apropos i'm fighting highlight performance in Shapes and now wondering the same thing there...
Edit Also super cool visualization. Took me a good long while to realize what it was showing, but once I got it.... 🤯