cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

[FEA]: nx-cugraph should intercept built-in constructors like `from_pandas_edgelist` if `NETWORKX_BACKEND_PRIORITY=cugraph`

Open beckernick opened this issue 1 year ago • 5 comments

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

When we use nx-cugraph, we currently need to create the NetworkX graph on the CPU regardless of whether every algorithm we intend to use is supported by the cuGraph backend. As a result, we pay a non-trivial performance penalty converting between CPU and GPU graphs.

The new caching mechanism configurable via CACHE_CONVERTED_GRAPH=True was designed to address this problem, making it possible to only pay this cost once per graph if you're going to run multiple algorithms.

But it would be great to avoid this cost in the first place by dispatching on the graph construction operators in addition to the algorithms. In the example below, we spend significant time in from_pandas_edgelist and _convert_graph (the latter of which is only a one-time cost if we use caching).

If I've already committed to using the cuGraph backend as the top priority backend, I'd ideally just create the graph on the GPU and only pay the CPU/GPU conversion cost if I need to fallback to the CPU.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")
%%snakeviz

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)
Screenshot 2024-05-31 at 10 03 40 AM

But cuGraph supports from_pandas_edgelist and it's much faster (100ms vs 8s in this case):

%timeit -n3 -r3 G_gpu = cugraph.from_pandas_edgelist(df.head(1000000), source="src", destination="dst")
71.2 ms ± 11.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

Describe your ideal solution

The following code should dispatch to the cuGraph backend for from_pandas_edgelist in addition to pagerank.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow cuGraph's Code of Conduct
  • [X] I have searched the open feature requests and have found no duplicates for this feature request

beckernick avatar May 31 '24 14:05 beckernick