datashader
datashader copied to clipboard
Speed up connect_edges
Currently, connect_edges
is pretty slow. This speeds it up a lot
Here's an example of the speed up
import numba
import pandas as pd
import numpy as np
from datashader.bundling import connect_edges
New implementation
@numba.njit
def _edge_val_to_segments(vals):
edge_vals = np.full(vals.size * 3, np.nan, np.dtype("float"))
for i in numba.prange(len(vals)):
pos = int(i * 3)
edge_vals[pos:pos+2] = vals[i]
return edge_vals
@numba.njit
def _format_graph_segments(nodes, edges):
edge_table = np.full((edges.shape[0] * 3, 2), np.nan)
for i in numba.prange(edges.shape[0]):
pos = int(i * 3)
s, t = edges[i]
edge_table[pos , 0] = nodes[s, 0]
edge_table[pos , 1] = nodes[s, 1]
edge_table[pos+1, 0] = nodes[t, 0]
edge_table[pos+1, 1] = nodes[t, 1]
return edge_table
def _connect_edges(
nodes,
edges,
source="source",
target="target",
weight=None,
x="x",
y="y",
include_edge_id=False
):
e = pd.DataFrame(
_format_graph_segments(
nodes[[x, y]].values,
edges[[source, target]].values),
columns=[x, y]
)
if weight is not None:
e[weight] = _edge_val_to_segments(edges[weight].values)
if include_edge_id:
e["edge_id"] = _edge_val_to_segments(edges["id"])
return e
# Data gen
def gen_graph(nnodes=100000, nedges=1000000):
nodes = pd.DataFrame(
np.random.random((nnodes, 2)),
columns=list("xy")
)
edges = pd.DataFrame(
np.random.randint(0, nnodes, size=(nedges, 2), dtype=int),
columns=["source", "target"]
)
return nodes, edges
nodes, edges = gen_graph()
# Running:
# Jit warmup
new = _connect_edges(nodes, edges)
orig = connect_edges(nodes, edges)
assert np.allclose(new, orig, equal_nan=True)
%timeit _connect_edges(nodes, edges)
# 58.5 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%time connect_edges(nodes, edges)
# CPU times: user 53.2 s, sys: 630 ms, total: 53.9 s
# Wall time: 54 s
I've labelled this as a work in progress for a few reasons:
- [ ] I currently can't run all the tests locally (#724), so opening this pull request will let me see if anything else fails
- [ ] This could be sped up even more with
njit(parallel=True)
ornjit(cache=True)
, but I don't see those being used a lot in this codebase. Should I set one of those? Is there a preference for one over the other? - [ ]
test_directly_connect_with_weights[True]
andtest_directly_connect_without_weights[True]
undertest_bundling.py
are failing, but the tests might be wrong. Here's what I mean:
Given an edge data frame that look like this:
id source target weight
0 0 0 1 1.0
1 1 0 2 1.0
2 2 0 3 1.0
3 3 0 4 1.0
I would think the output should have edge ids [0, 1, 2, 3]
. However, the test says they should have ids [1, 2, 3, 4]
(link). It looks to me like the edge_id
is being set by id of the target node because the node dataframe that's passed also has an id
field.
Thanks for the contribution! For the WIP items:
- Hopefully if you use
pytest ==3.9.3
andpytest-benchmark ==3.0.0
you can run the tests locally, or you can look at the results from CI above. - The datashader codebase was written for an early version of Numba, and while we've updated some parts, there are surely areas where we can make use of newer Numba features. So feel free to use anything now available in Numba!
- ids starting at 0 look correct to me, but I haven't studied the code. Maybe @philippjfr remembers the details here and can comment?
Maybe @philippjfr remembers the details here and can comment?
Don't think I've ever looked at the internals of that code although we do expose the functionality in holoviews.
@philippjfr could you point out to me where that's exposed? I'd like to see an example of this argument being used.
I found the holoviews function connect_edges_pd
, which does what I had previously assumed connect_edges(..., include_edge_id=True)
would do: always making the index of the edge dataframe the edge ids, instead of assuming there would be an id
field.
While CI and installation stuff gets worked out, do I have the devs blessing to change the edge id tests?
I wish I could give you that blessing, but I won't have a chance to look at this code until Tuesday to see whether it was the code or the tests that was wrong before. Sorry about that!
@jbednar, any chance you've been able to look at those tests?
Sorry for the delay! I did look at them, but it's a mess! it looks like different decisions were made between HoloViews and Datashader about what to do when there is an id field available, and we haven't yet worked out what to do about that, how to maintain backwards compatibility , and how to deal with it. I had hoped we could address this before our release today, but other deadlines prevented us from fully working that out, so we have had to postpone it again to next week, after the release. If we find a problem we'll fix it and make a new release, merging your improvements either way. So we haven't forgotten about this; it's just delayed more...
No problem! Yeah, the current implementation was much more complicated than I was expecting.
I would note the current datashader function does return faulty values if the data doesn't match up exactly with the expectations.
For example, include_edge_id=True
without having an id
field messes up the layout.
import pandas as pd
import numpy as np
import datashader as ds
from datashader.layout import circular_layout
from datashader.bundling import connect_edges
n, m = 100, 1000
nodes = pd.DataFrame(["node"+str(i) for i in range(n)], columns=['name'])
edges = pd.DataFrame(np.random.randint(0,len(nodes), size=(m, 2)),
columns=['source', 'target'])
circular = circular_layout(nodes, uniform=False)
print(connect_edges(circular, edges).head())
print(connect_edges(circular, edges, include_edge_id=True).head())
# x y
# 0 0.149448 0.143471
# 1 0.037398 0.310265
# 2 NaN NaN
# 3 0.999698 0.517384
# 4 0.002216 0.452981
# edge_id x y
# 0 0.149448 0.143471 3.739811e-02
# 1 0.149448 0.310265 4.594811e-322
# 2 NaN NaN NaN
# 3 0.999698 0.517384 2.215678e-03
# 4 0.999698 0.452981 1.185758e-322
Additionally, it doesn't throw an error if you don't provide information it needs (like node positions).
I would've opened an issue, but figured this PR fixes it anyways.
I reran the Appveyor tests on this PR to try to see what is failing. Some old failures have gone away (presumably issues fixed on master) but the tests that are failing do seem to relate to this PR. In particulartest_directly_connect_with_weights
seems to be failing due to unexpected results.
It is odd that this issue only appears on Windows but on that platform at least, either the unit test definition or the data generated using the code in this PR is incorrect. That is a shame as I would love to see this PR merged: those are some very nice speedups!
Actually, I'm not convinced that unit test is being on travis which would explain why only appveyor is failing right now!
IIRC, different environments ran different sets of tests.
There was some discussion of this in https://github.com/holoviz/datashader/issues/724, but I'm not sure if the issue has been solved since.
Sorry for the long delay, we're interested in this work, could you push an empty commit or a trivial change to get the test suite to run again? Thanks !
Sorry for the long delay, we're interested in this work
Ha, no worries. I may not be able to promise much time to this now, but took a quick look over the code again.
Looking at the CI, it seems like those tests still need to be modified. It's clearly wrong that there are edges with id
s [0, 1, 2, 3]
in the input, but edge_ids
[1, 2, 3, 4]
are referenced in the output, right?
@ivirshup I believe that you will be attending the NumFOCUS Project Summit next week so perhaps we can have a chat about this work then?
Sure. Did we get a list of people who are going to be there?
There is a summit-2023
channel in the NumFOCUS slack and I am assuming that everyone in the channel will be attending.