Add option to load graph from CSV edgelist
What is the expected enhancement?
Hi everybody, I was just wondering whether to ask here or on the pethgraph repo whether there is any plan to add the option to load a graph from a CSV adjacency list.
While this might not seem useful, when the graph starts to be big, python is not fast enough to load the data in an acceptable time.
For example, I have a weighted adjacency list of ~4M edges, which takes half an hour to load using Python. This is far from the biggest net I need to analyze (which is in the neighborhood of ~470M edges), and having the option to load it directly from within a rust-wrapped method would speed up performances.
A method for reading an edge list file already exists in rustworkx: https://www.rustworkx.org/apiref/rustworkx.PyDiGraph.read_edge_list.html#rustworkx.PyDiGraph.read_edge_list (with an identical method on PyGraph). Are you looking for additional functionality beyond what this method offers?
Thanks, I think I have missed it while looking trough the documentation... That being said, is there a way to pass also a callback to the function so that it can be integrated with libraries such as alive-bar to show progress while loading the graph into memory?
See #1033 and #1066 for some related issues.
Ideally, we could use Polars very fast CSV parsing, cast it to Arrow data and read that.
Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.
Thank you very much! Just one last thing that I cannot understand: the edge list I have is weighted (the graph itself is undirected), however when I try to import it it fails as it finds an integer. Is there a way to import a weighted edgelist?
Edit: I looked at the rust source code for read_edge_list and saw that it can support weighted edge lists, however, when I try to import mine, which is in this form:
A5095008984,A5114004474,1
A5027623414,A5091452978,1
A5094817169,A5094700000,1
A5112796227,A5095008930,1
A5088113121,A5032942346,1
A5094388832,A5094785058,1
A5113839106,A5095008938,1
A5111637342,A5104056478,1
A5086587888,A5112796220,1
A5110885190,A5112796217,1
I get a TypeError, saying that it found an unexpected number in str...
Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.
This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call add_node and add_edge as it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).
Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.
This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call
add_nodeandadd_edgeas it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).
I agree, I think the catch is that for CSV some formats are:
- edge list format (in example)
- adjacency matrix format (requested in a bug)
- adjacency list format (also possible)
Overall, if we could have the equivalent of https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html#from-pandas-edgelist in rustworkx but with Polars that would be our best bet.
Polars would read the input fast, we'd load it in PyGraph/PyDiGraph fast as well. We could support more standardized text formats, but for general I/O it might be more reasonable to leverage other libraries.
If you are using csv or any other separator, then you need to pass deliminator and in this case, it should be ",". As your data contains node labels, you need to pass label=True. Example: graph = rx.PyGraph.read_edge_list(path=file, deliminator=",",labels=True)
It seems like the CSV file cannot have a header (field names)? For example, it would be nice to be able to specify the columns for the source, target, and weight.
Also, for reference we benchmarked a while ago that the time to load a graph with all of the USA roads (~23.9M nodes, ~58.3 M edges) took around a minute: https://www.rustworkx.org/benchmarks.html. So once you get the data into Python, we should be able to process it fast.
This benchmark was written with the file processing to be all done in Python and calling the libraries under test in a for loop over the file call
add_nodeandadd_edgeas it iterated. This was partially to have a consistent baseline and partially because the dimacs format used in the file isn"t a standard that is commonly used. The methods implemented in rustworkx that do the file i/o and iteration in rust will be a lot faster (as a rule of thumb at least 10x faster).I agree, I think the catch is that for CSV some formats are:
- edge list format (in example)
- adjacency matrix format (requested in a bug)
- adjacency list format (also possible)
Overall, if we could have the equivalent of https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html#from-pandas-edgelist in rustworkx but with Polars that would be our best bet.
Polars would read the input fast, we'd load it in PyGraph/PyDiGraph fast as well. We could support more standardized text formats, but for general I/O it might be more reasonable to leverage other libraries.
Being able to generate a graph from a DataFrame would be a nice addition (And perhaps a separate issue?). How about making the call agnostic to the input DataFrame type by using Narwhals? It basically converts input DataFrames from various libraries to Polars. https://narwhals-dev.github.io/narwhals/