cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

[QST]: MST cuGraph RuntimeError: RAFT failure, how to fix?

Open stephwon opened this issue 1 year ago • 7 comments

What is your question?

I'm trying to conduct a Steiner Tree approximation with several terminal nodes (and subgraphs I need to analyze). However, whenever I try to iterate through, it fails at MST despite the graph being Undirected: RuntimeError: RAFT failure at file=/scratch/user/anaconda3/envs/rapids-24.04/include/raft/sparse/solver/detail/mst_solver_inl.cuh line=152:

The code works if I run through terminal nodes for each subgraph (one at a time). I already checked for self-loops, isolated nodes, or disconnected components in the big graph data, and they don't exist. The data types are to the Dask cugraph standard, so there is no issue there. Even if I change `DiG = cugraph.Graph(directed=False)' problem persists with the same error message. I'm not sure how to address this error... or am I missing something here? Graph data (df_graph) has 46.6M edges, directed and unweighted. My code below:

#Set working directory
import os
os.chdir('/scratch') 

# Install required packages
import pandas as pd
import matplotlib.pyplot as plt
import yaml
import sys 
import csv

# Install Networka
import networkx as nx

import cudf
cudf.__version__

import cugraph
cugraph.__version__

# Import Dask to utilize Muti-GPU
import dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

import cugraph.dask as dcg
import cugraph.dask.comms.comms as Comms
from cugraph.generators.rmat import rmat

# Initialize Dask CUDA cluster and client
cluster = LocalCUDACluster()
client = Client(cluster)

# Initialize cuGraph communication
Comms.initialize(p2p=True)


columns_to_read = ['subject', 'object', 'predicate']

# Read the TSV file using Dask cuDF with specific columns
df_graph = dask_cudf.read_csv('big_graph_edge.tsv', sep='\t', 
                        usecols=columns_to_read)

# Renaming columns to match cuGraph requirements
df_graph = df_graph.rename(columns={'subject': 'source', 'object': 'destination'})

print(df_graph.head())

# Create graph from input data
DiG = cugraph.Graph(directed=True)
DiG.from_dask_cudf_edgelist(df_graph, source = 'source', destination = 'destination' )

# Verify the graph has edges
print("Number of edges in the graph:", DiG.number_of_edges())

df = pd.read_csv('/scratch/subgraph_nodes.csv', sep = ",")
df

# Define a function to clean each c_id list
def get_terminals(c_ids):
    # Remove the leading and trailing square brackets
    c_ids = c_ids.strip("[]")
    # Split by comma and clean each element
    return [c_id.strip().strip("'") for c_id in c_ids.split(',')]

# Apply the cleaning function to the 'c_ids' column and convert to list of lists
terminals = df['c_ids'].head().apply(lambda x: get_terminals(x) if pd.notnull(x) else []).tolist()
terminals # output is list of lists

# Filter out terminals that are not in the graph
filt_subgraphs = [[t for t in terminal if DiG.has_node(t)] for terminal in terminals]

# Check if there are valid terminals
if not filt_subgraphs:
    raise ValueError("None of the terminal vertices are present in the graph")

 def bfs_with_cache(graph, start, cache):
    if start in cache:
        return cache[start]
    distances = dcg.bfs(graph, start=start)
    distances_df = distances.compute().to_pandas()
    cache[start] = distances_df
    return distances_df

 def get_edges(lst,DiG,cache):
    edges = []

    # Find the shortest paths between all pairs of terminal vertices using BFS
    for l in lst:
        distances_df = bfs_with_cache(DiG,l,cache)
        filtered_df = distances_df[distances_df['vertex'].isin(lst)]
        
        for _, row in filtered_df.iterrows():
            if row['vertex'] != l:
                edges.append((l, row['vertex'], row['distance']))
                
    # Create a new DataFrame to hold the edges of the shortest path graph
    edges_df = pd.DataFrame(edges, columns=['source', 'destination', 'weight'])
    return edges_df

def get_mst_results(edges_df):
    # Create a new graph for MST purpose
    mst_graph = cugraph.Graph(directed=False)
    mst_graph.from_pandas_edgelist(edges_df, source='source', destination='destination', edge_attr='weight')

    # Compute the MST
    mst = cugraph.minimum_spanning_tree(mst_graph)

    # Extract the edge list of the MST
    mst_edges = mst.view_edge_list()
    mst_df = mst_edges.to_pandas()

    # Get number of edges and nodes
    num_edges = mst_df.shape[0]
    num_nodes = len(pd.concat([mst_df['src'], mst_df['dst']]).unique())
    
    return num_edges, num_nodes


cache = {}

# need access to ids, associate edges to graph ids

#print("Graph ID:")

### Main code execution
for subgraph in filt_subgraphs:
    graph_edges = get_edges(subgraph,DiG,cache)
    n_edges,n_nodes = get_mst_results(graph_edges)

print(f"Number of edges in the MST: {n_edges}")
print(f"Number of nodes in the MST: {n_nodes}")
print("\n")


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[21], line 10
      8 for subgraph in filt_subgraphs:
      9     graph_edges = get_edges(subgraph,DiG,cache)
---> 10     n_edges,n_nodes = get_mst_results(graph_edges)
     12 print(f"Number of edges in the MST: {n_edges}")
     13 print(f"Number of nodes in the MST: {n_nodes}")

Cell In[16], line 7, in get_mst_results(edges_df)
      4 mst_graph.from_pandas_edgelist(edges_df, source='source', destination='destination', edge_attr='weight')
      6 # Compute the MST
----> 7 mst = cugraph.minimum_spanning_tree(mst_graph)
      9 # Extract the edge list of the MST
     10 mst_edges = mst.view_edge_list()

File /scratch/user/anaconda3/envs/rapids-24.04/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py:105, in minimum_spanning_tree(G, weight, algorithm, ignore_nan)
    103     return cugraph_to_nx(mst)
    104 else:
--> 105     return _minimum_spanning_tree_subgraph(G)

File /scratch/user/anaconda3/envs/rapids-24.04/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py:26, in _minimum_spanning_tree_subgraph(G)
     24 if G.is_directed():
     25     raise ValueError("input graph must be undirected")
---> 26 mst_df = minimum_spanning_tree_wrapper.minimum_spanning_tree(G)
     27 if G.renumbered:
     28     mst_df = G.unrenumber(mst_df, "src")

File minimum_spanning_tree_wrapper.pyx:71, in cugraph.tree.minimum_spanning_tree_wrapper.minimum_spanning_tree()

File minimum_spanning_tree_wrapper.pyx:52, in cugraph.tree.minimum_spanning_tree_wrapper.mst_double()

RuntimeError: RAFT failure at file=/scratch/user/anaconda3/envs/rapids-24.04/include/raft/sparse/solver/detail/mst_solver_inl.cuh line=152: 

Code of Conduct

  • [X] I agree to follow cuGraph's Code of Conduct
  • [X] I have searched the open issues and have found no duplicates for this question

stephwon avatar Jun 03 '24 00:06 stephwon

Is this a dataset you can share? This is not an error I've seen with some of our existing data sets. The call is somehow getting more answers than should be possible. It's probably a bug that only occurs with a specific dataset.

ChuckHastings avatar Jun 28 '24 15:06 ChuckHastings

@ChuckHastings Thanks for the earlier help with running cugraph leiden. I am observing this issue with cugraph MST as well. It runs on the com-Orkut graph, but not on the following graphs:

- web-Stanford
- indochina-2004
- com-LiveJournal
- asia_osm
- europe_osm
- kmer_A2a
- kmer_V1r

The reported error is as follows:

Converting /home/graphwork/Data/web-Stanford.mtx to /home/graphwork/Data/web-Stanford.mtx.csv ...
Running cuGraph MST on /home/graphwork/Data/web-Stanford.mtx.csv ...
Initializing RMM pool...
Reading graph from file: /home/graphwork/Data/web-Stanford.mtx.csv
Creating cuGraph graph...
Running minimum_spanning_tree (first)...
Traceback (most recent call last):
  File "/home/graphwork/Documents/MST/test-cugraph-mst/main.py", line 31, in <module>
    mst = cugraph.minimum_spanning_tree(G)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/graphwork/miniconda3/envs/cugraph-env/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py", line 105, in minimum_spanning_tree
    return _minimum_spanning_tree_subgraph(G)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/graphwork/miniconda3/envs/cugraph-env/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py", line 26, in _minimum_spanning_tree_subgraph
    mst_df = minimum_spanning_tree_wrapper.minimum_spanning_tree(G)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "minimum_spanning_tree_wrapper.pyx", line 71, in cugraph.tree.minimum_spanning_tree_wrapper.minimum_spanning_tree
  File "minimum_spanning_tree_wrapper.pyx", line 52, in cugraph.tree.minimum_spanning_tree_wrapper.mst_double
RuntimeError: RAFT failure at file=/home/graphwork/miniconda3/envs/cugraph-env/include/raft/sparse/solver/detail/mst_solver_inl.cuh line=152:

The output on com-Orkut (using A100) is as follows:

Converting /home/graphwork/Data/com-Orkut.mtx to /home/graphwork/Data/com-Orkut.mtx.csv ...
Running cuGraph MST on /home/graphwork/Data/com-Orkut.mtx.csv ...
Initializing RMM pool...
Reading graph from file: /home/graphwork/Data/com-Orkut.mtx.csv
Creating cuGraph graph...
Running minimum_spanning_tree (first)...
Running minimum_spanning_tree...
minimum_spanning_tree weight: 3072440.000000
minimum_spanning_tree took: 0.066725 s
Running minimum_spanning_tree...
minimum_spanning_tree weight: 3072440.000000
minimum_spanning_tree took: 0.066817 s
Running minimum_spanning_tree...
minimum_spanning_tree weight: 3072440.000000
minimum_spanning_tree took: 0.066751 s
Running minimum_spanning_tree...
minimum_spanning_tree weight: 3072440.000000
minimum_spanning_tree took: 0.066882 s

wolfram77 avatar Jan 08 '25 02:01 wolfram77

I can recreate your error with our latest code. I will be digging into this and let you know how things progress.

ChuckHastings avatar Feb 06 '25 18:02 ChuckHastings

@wolfram77 I was able to reproduce the error you are seeing with a much smaller datasets [karate]. You can track the progress of this work in this issue where I provide explanations pertaining to this issue. In fact, you are not providing edge weights hence the implied values are 1.0 making them non distinct and a quick solution while this issue is addressed in the raft API could be to provide distinct weights as below. However, this may lead to non deterministic but valid result.

df = cudf.read_csv(datasets, sep=' ', usecols=columns_to_read)
distinct_wgt = np.random.choice(np.arange(0, 1, 0.001), size=len(df), replace=False)
df['wgt'] = distinct_wgt
G.from_cudf_edgelist(df, source='src', destination='dst', edge_attr='wgt' renumber=False)

jnke2016 avatar Jun 17 '25 11:06 jnke2016

@jnke2016 Thanks for an update on this. So, assigning random weights fixes this issue. Cool. My friend was working on this problem for massive graphs, and so I was trying this out.

wolfram77 avatar Jun 17 '25 14:06 wolfram77

https://github.com/rapidsai/raft/pull/2707

ChuckHastings avatar Jun 17 '25 16:06 ChuckHastings

I went through the issue again. It seems you are adding random weights in the range [0, 1). This would make the new weights [1, 2). This would work, since the weight of no edge can be greater than or equal to the weight of two or more edges.

But, this assumption will break on weighted graphs, with say, all edge weights being 0.1. In such a case multiple edge connecting another vertex could be considered shorter than a single edge.

Perpahs, the easiest approach would be to consider the vertex IDs for tie breaking, instead of relying purely of edge weights. What do you think?

wolfram77 avatar Jun 18 '25 15:06 wolfram77

I went through the issue again. It seems you are adding random weights in the range [0, 1)

That's correct.

But, this assumption will break on weighted graphs, with say, all edge weights being 0.1. In such a case multiple edge connecting another vertex could be considered shorter than a single edge.

Keep in mind that the random weights are added in a way that preserves the relative order the edges hence this will work even if all edge weights are 0.1. In fact, there is a test covering this case were all weights are either 0.1 or 0.2.

Perpahs, the easiest approach would be to consider the vertex IDs for tie breaking, instead of relying purely of edge weights. What do you think

I do agree that the random weights approach exposes the implementation to several edge cases hence, our new implementation of MST(not currently on our roadmap) will likely break ties with vertex IDs.

jnke2016 avatar Jun 27 '25 16:06 jnke2016

@jnke2016 I have added a comment in https://github.com/rapidsai/raft/pull/2707/files

wolfram77 avatar Jun 30 '25 15:06 wolfram77

Thanks @wolfram77 . I replied

jnke2016 avatar Jun 30 '25 16:06 jnke2016