rustworkx icon indicating copy to clipboard operation
rustworkx copied to clipboard

Add option to load graph from CSV edgelist

Open marcoSanti opened this issue 8 months ago • 9 comments

What is the expected enhancement?

Hi everybody, I was just wondering whether to ask here or on the pethgraph repo whether there is any plan to add the option to load a graph from a CSV adjacency list.

While this might not seem useful, when the graph starts to be big, python is not fast enough to load the data in an acceptable time.

For example, I have a weighted adjacency list of ~4M edges, which takes half an hour to load using Python. This is far from the biggest net I need to analyze (which is in the neighborhood of ~470M edges), and having the option to load it directly from within a rust-wrapped method would speed up performances.

marcoSanti avatar Apr 03 '25 12:04 marcoSanti

A method for reading an edge list file already exists in rustworkx: https://www.rustworkx.org/apiref/rustworkx.PyDiGraph.read_edge_list.html#rustworkx.PyDiGraph.read_edge_list (with an identical method on PyGraph). Are you looking for additional functionality beyond what this method offers?

mtreinish avatar Apr 03 '25 14:04 mtreinish

Thanks, I think I have missed it while looking trough the documentation... That being said, is there a way to pass also a callback to the function so that it can be integrated with libraries such as alive-bar to show progress while loading the graph into memory?

marcoSanti avatar Apr 04 '25 08:04 marcoSanti

See #1033 and #1066 for some related issues.

Ideally, we could use Polars very fast CSV parsing, cast it to Arrow data and read that.

Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.

IvanIsCoding avatar Apr 06 '25 05:04 IvanIsCoding

Thank you very much! Just one last thing that I cannot understand: the edge list I have is weighted (the graph itself is undirected), however when I try to import it it fails as it finds an integer. Is there a way to import a weighted edgelist?

Edit: I looked at the rust source code for read_edge_list and saw that it can support weighted edge lists, however, when I try to import mine, which is in this form:

A5095008984,A5114004474,1
A5027623414,A5091452978,1
A5094817169,A5094700000,1
A5112796227,A5095008930,1
A5088113121,A5032942346,1
A5094388832,A5094785058,1
A5113839106,A5095008938,1
A5111637342,A5104056478,1
A5086587888,A5112796220,1
A5110885190,A5112796217,1

I get a TypeError, saying that it found an unexpected number in str...

marcoSanti avatar Apr 08 '25 07:04 marcoSanti

Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.

This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call add_node and add_edge as it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).

mtreinish avatar Apr 08 '25 10:04 mtreinish

Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.

This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call add_node and add_edge as it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).

I agree, I think the catch is that for CSV some formats are:

  • edge list format (in example)
  • adjacency matrix format (requested in a bug)
  • adjacency list format (also possible)

Overall, if we could have the equivalent of https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html#from-pandas-edgelist in rustworkx but with Polars that would be our best bet.

Polars would read the input fast, we'd load it in PyGraph/PyDiGraph fast as well. We could support more standardized text formats, but for general I/O it might be more reasonable to leverage other libraries.

IvanIsCoding avatar Apr 12 '25 14:04 IvanIsCoding

If you are using csv or any other separator, then you need to pass deliminator and in this case, it should be ",". As your data contains node labels, you need to pass label=True. Example: graph = rx.PyGraph.read_edge_list(path=file, deliminator=",",labels=True)

rahaman-quantum avatar May 08 '25 11:05 rahaman-quantum

It seems like the CSV file cannot have a header (field names)? For example, it would be nice to be able to specify the columns for the source, target, and weight.

stefancoe avatar Jul 16 '25 14:07 stefancoe

Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.

This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call add_node and add_edge as it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).

I agree, I think the catch is that for CSV some formats are:

  • edge list format (in example)
  • adjacency matrix format (requested in a bug)
  • adjacency list format (also possible)

Overall, if we could have the equivalent of https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html#from-pandas-edgelist in rustworkx but with Polars that would be our best bet.

Polars would read the input fast, we'd load it in PyGraph/PyDiGraph fast as well. We could support more standardized text formats, but for general I/O it might be more reasonable to leverage other libraries.

Being able to generate a graph from a DataFrame would be a nice addition (And perhaps a separate issue?). How about making the call agnostic to the input DataFrame type by using Narwhals? It basically converts input DataFrames from various libraries to Polars. https://narwhals-dev.github.io/narwhals/

stefancoe avatar Aug 02 '25 07:08 stefancoe