datashader
datashader copied to clipboard
Improve graph line rendering
Add line functionality to accept input in form (x_src, y_src, x_dst, y_dst), and run on gpu
Codecov Report
Merging #1100 (f826f4e) into master (8f9ec7c) will decrease coverage by
0.04%. The diff coverage isn/a.
@@ Coverage Diff @@
## master #1100 +/- ##
==========================================
- Coverage 85.07% 85.02% -0.05%
==========================================
Files 34 34
Lines 7516 7515 -1
==========================================
- Hits 6394 6390 -4
- Misses 1122 1125 +3
| Impacted Files | Coverage Δ | |
|---|---|---|
| datashader/core.py | 88.11% <ø> (ø) |
|
| datashader/glyphs/polygon.py | 95.33% <0.00%> (-0.67%) |
:arrow_down: |
| datashader/glyphs/line.py | 93.43% <0.00%> (-0.22%) |
:arrow_down: |
| datashader/macros.py | 92.92% <0.00%> (-0.08%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 8f9ec7c...f826f4e. Read the comment docs.
@sagfox I have been thinking about this since we talked, and I think the first task is to identify exactly what your preferred data format is. If that is a DataFrame with 4 columns x_src, y_src, x_dst, y_dst then Datashader already supports that via the LineAxis1 class:
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
cvs = ds.Canvas(plot_width=100, plot_height=100)
df = pd.DataFrame(dict(
x_src=[2, 9, 5, 3], y_src=[1, 9, 1, 9],
x_dst=[1, 3, 7, 9], y_dst=[5, 6, 5, 2],
))
agg = cvs.line(source=df, x=["x_src", "x_dst"], y=["y_src", "y_dst"], axis=1, agg=ds.count())
im = tf.shade(agg)
ds.utils.export_image(im, "lines_2pts", background="white")
But if your motivation is graphs with nodes and edges and each node can potentially have many incident edges then this format duplicates your node coordinates. Maybe a better approach is support for indexed lines, i.e. the nodes as a sequence of (x, y) coordinates and the edges a sequence of (start_index, end_index) which index into the nodes sequence. This would seem to imply two separate DataFrames of different lengths, which Datashader doesn't yet support, but that should not be insurmountable.
This would seem to imply two separate DataFrames of different lengths, which Datashader doesn't yet support, but that should not be insurmountable.
That would be a really useful format to support, because it would facilitate associating other columns of data with each node for use with inspect_points hovering or drilldown. Duplicating the coordinates themselves isn't too bad, but duplicating the associated metadata can get extremely expensive.
@sagfox I have been thinking about this since we talked, and I think the first task is to identify exactly what your preferred data format is. If that is a DataFrame with 4 columns
x_src,y_src,x_dst,y_dstthen Datashader already supports that via theLineAxis1class:import pandas as pd import datashader as ds import datashader.transfer_functions as tf cvs = ds.Canvas(plot_width=100, plot_height=100) df = pd.DataFrame(dict( x_src=[2, 9, 5, 3], y_src=[1, 9, 1, 9], x_dst=[1, 3, 7, 9], y_dst=[5, 6, 5, 2], )) agg = cvs.line(source=df, x=["x_src", "x_dst"], y=["y_src", "y_dst"], axis=1, agg=ds.count()) im = tf.shade(agg) ds.utils.export_image(im, "lines_2pts", background="white")
@ianthomas23 that is interesting, I have never come across examples which demonstrate that. This could be all we need, but I was wondering for the graph edge implementation in datashader here, why is it using a path, with NaN after every 2 rows, to break the path to render it as a separate line, instead of the (x_src, y_src, x_dst, y_dst) approach? Or is the same thing underneath?
for the graph edge implementation in datashader here, why is it using a path, with NaN after every 2 rows, to break the path to render it as a separate line, instead of the (x_src, y_src, x_dst, y_dst) approach? Or is the same thing underneath?
Good question! I believe the answer simply is that edge bundling was implemented by Ian Calvert before the other line formats were implemented by Jon Mease. It does sound like it would be much more efficient to update bundling to use this more efficient format.
Datashader suffers a bit from a history of "drive-by" contributions from specific people at specific times that then aren't fully integrated with the rest of the codebase as new features are introduced because of a lack of core maintainer staffing. I'm hoping that Ian now being on staff we can address issues like that, but that will only happen gradually as he starts to work on topics that overlap with a particular area of the code. So if we have a project involving edge bundling (which is on our list but quite low in priority right now), then we can revisit and update that.