datashader Speed up connect

Speed up connect_edges

Open ivirshup opened this issue 5 years ago • 16 comments

Currently, connect_edges is pretty slow. This speeds it up a lot

Here's an example of the speed up

import numba
import pandas as pd
import numpy as np
from datashader.bundling import connect_edges

New implementation

@numba.njit
def _edge_val_to_segments(vals):
    edge_vals = np.full(vals.size * 3, np.nan, np.dtype("float"))
    for i in numba.prange(len(vals)):
        pos = int(i * 3)
        edge_vals[pos:pos+2] = vals[i]
    return edge_vals

@numba.njit
def _format_graph_segments(nodes, edges):
    edge_table = np.full((edges.shape[0] * 3, 2), np.nan)
    for i in numba.prange(edges.shape[0]):
        pos = int(i * 3)
        s, t = edges[i]
        edge_table[pos  , 0] = nodes[s, 0]
        edge_table[pos  , 1] = nodes[s, 1]
        edge_table[pos+1, 0] = nodes[t, 0]
        edge_table[pos+1, 1] = nodes[t, 1]
    return edge_table

def _connect_edges(
    nodes,
    edges,
    source="source",
    target="target",
    weight=None,
    x="x",
    y="y",
    include_edge_id=False
):
    e = pd.DataFrame(
        _format_graph_segments(
            nodes[[x, y]].values,
            edges[[source, target]].values),
        columns=[x, y]
    )
    if weight is not None:
        e[weight] = _edge_val_to_segments(edges[weight].values)
    if include_edge_id:
        e["edge_id"] = _edge_val_to_segments(edges["id"])
    return e

# Data gen
def gen_graph(nnodes=100000, nedges=1000000):
    nodes = pd.DataFrame(
        np.random.random((nnodes, 2)),
        columns=list("xy")
    )
    edges = pd.DataFrame(
        np.random.randint(0, nnodes, size=(nedges, 2), dtype=int),
        columns=["source", "target"]
    )
    return nodes, edges

nodes, edges = gen_graph()

# Running:

 # Jit warmup
new = _connect_edges(nodes, edges)
orig = connect_edges(nodes, edges)
assert np.allclose(new, orig, equal_nan=True)


%timeit _connect_edges(nodes, edges)
# 58.5 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%time connect_edges(nodes, edges)
# CPU times: user 53.2 s, sys: 630 ms, total: 53.9 s
# Wall time: 54 s

I've labelled this as a work in progress for a few reasons:

[ ] I currently can't run all the tests locally (#724), so opening this pull request will let me see if anything else fails
[ ] This could be sped up even more with njit(parallel=True) or njit(cache=True), but I don't see those being used a lot in this codebase. Should I set one of those? Is there a preference for one over the other?
[ ] test_directly_connect_with_weights[True] and test_directly_connect_without_weights[True] under test_bundling.py are failing, but the tests might be wrong. Here's what I mean:

Given an edge data frame that look like this:

   id  source  target  weight
0   0       0       1     1.0
1   1       0       2     1.0
2   2       0       3     1.0
3   3       0       4     1.0

I would think the output should have edge ids [0, 1, 2, 3]. However, the test says they should have ids [1, 2, 3, 4] (link). It looks to me like the edge_id is being set by id of the target node because the node dataframe that's passed also has an id field.

Mar 12 '19 02:03 ivirshup

Thanks for the contribution! For the WIP items:

Hopefully if you use pytest ==3.9.3 and pytest-benchmark ==3.0.0 you can run the tests locally, or you can look at the results from CI above.
The datashader codebase was written for an early version of Numba, and while we've updated some parts, there are surely areas where we can make use of newer Numba features. So feel free to use anything now available in Numba!
ids starting at 0 look correct to me, but I haven't studied the code. Maybe @philippjfr remembers the details here and can comment?

Mar 12 '19 21:03 jbednar

Maybe @philippjfr remembers the details here and can comment?

Don't think I've ever looked at the internals of that code although we do expose the functionality in holoviews.

Mar 13 '19 00:03 philippjfr

@philippjfr could you point out to me where that's exposed? I'd like to see an example of this argument being used.

I found the holoviews function connect_edges_pd, which does what I had previously assumed connect_edges(..., include_edge_id=True) would do: always making the index of the edge dataframe the edge ids, instead of assuming there would be an id field.

Mar 13 '19 03:03 ivirshup

While CI and installation stuff gets worked out, do I have the devs blessing to change the edge id tests?

Mar 15 '19 12:03 ivirshup

I wish I could give you that blessing, but I won't have a chance to look at this code until Tuesday to see whether it was the code or the tests that was wrong before. Sorry about that!

Mar 20 '19 04:03 jbednar

@jbednar, any chance you've been able to look at those tests?

Apr 01 '19 12:04 ivirshup

Sorry for the delay! I did look at them, but it's a mess! it looks like different decisions were made between HoloViews and Datashader about what to do when there is an id field available, and we haven't yet worked out what to do about that, how to maintain backwards compatibility , and how to deal with it. I had hoped we could address this before our release today, but other deadlines prevented us from fully working that out, so we have had to postpone it again to next week, after the release. If we find a problem we'll fix it and make a new release, merging your improvements either way. So we haven't forgotten about this; it's just delayed more...

Apr 03 '19 17:04 jbednar

No problem! Yeah, the current implementation was much more complicated than I was expecting.

I would note the current datashader function does return faulty values if the data doesn't match up exactly with the expectations.

For example, include_edge_id=True without having an id field messes up the layout.

import pandas as pd
import numpy as np
import datashader as ds
from datashader.layout import circular_layout
from datashader.bundling import connect_edges

n, m = 100, 1000

nodes = pd.DataFrame(["node"+str(i) for i in range(n)], columns=['name'])
edges = pd.DataFrame(np.random.randint(0,len(nodes), size=(m, 2)),
                     columns=['source', 'target'])

circular = circular_layout(nodes, uniform=False)

print(connect_edges(circular, edges).head())
print(connect_edges(circular, edges, include_edge_id=True).head())
#           x         y
# 0  0.149448  0.143471
# 1  0.037398  0.310265
# 2       NaN       NaN
# 3  0.999698  0.517384
# 4  0.002216  0.452981
#     edge_id         x              y
# 0  0.149448  0.143471   3.739811e-02
# 1  0.149448  0.310265  4.594811e-322
# 2       NaN       NaN            NaN
# 3  0.999698  0.517384   2.215678e-03
# 4  0.999698  0.452981  1.185758e-322

Additionally, it doesn't throw an error if you don't provide information it needs (like node positions).

I would've opened an issue, but figured this PR fixes it anyways.

Apr 05 '19 01:04 ivirshup

I reran the Appveyor tests on this PR to try to see what is failing. Some old failures have gone away (presumably issues fixed on master) but the tests that are failing do seem to relate to this PR. In particulartest_directly_connect_with_weights seems to be failing due to unexpected results.

It is odd that this issue only appears on Windows but on that platform at least, either the unit test definition or the data generated using the code in this PR is incorrect. That is a shame as I would love to see this PR merged: those are some very nice speedups!

Jan 28 '20 16:01 jlstevens

Actually, I'm not convinced that unit test is being on travis which would explain why only appveyor is failing right now!

Jan 28 '20 17:01 jlstevens

IIRC, different environments ran different sets of tests.

There was some discussion of this in https://github.com/holoviz/datashader/issues/724, but I'm not sure if the issue has been solved since.

Jan 29 '20 03:01 ivirshup

Sorry for the long delay, we're interested in this work, could you push an empty commit or a trivial change to get the test suite to run again? Thanks !

Nov 29 '21 16:11 maximlt

Sorry for the long delay, we're interested in this work

Ha, no worries. I may not be able to promise much time to this now, but took a quick look over the code again.

Looking at the CI, it seems like those tests still need to be modified. It's clearly wrong that there are edges with ids [0, 1, 2, 3] in the input, but edge_ids [1, 2, 3, 4] are referenced in the output, right?

Dec 10 '21 20:12 ivirshup

@ivirshup I believe that you will be attending the NumFOCUS Project Summit next week so perhaps we can have a chat about this work then?

Sep 06 '23 14:09 ianthomas23

Sure. Did we get a list of people who are going to be there?

Sep 06 '23 16:09 ivirshup

There is a summit-2023 channel in the NumFOCUS slack and I am assuming that everyone in the channel will be attending.

Sep 06 '23 16:09 ianthomas23

datashader datashader copied to clipboard

Speed up connect_edges

datashader
datashader copied to clipboard