cugraph
cugraph copied to clipboard
What is the expected repeatability of cugraph.leiden?
What is your question?
I know that one cannot expect exact repeatability over multiple runs of cugraph.leiden
(even setting random_state
to a constant). But how similar can we rely on the results being?
I've repeated 10 runs on a 15,000-vertex graph on 13 different GPUs, and got an average adjusted Rand index of 0.82 across the runs on a given GPU (min: 0.69, max: 0.95). That corresponds to a difference of ±1 cluster detected on average (and up to a 6-cluster difference in the worst case) for this graph. Is that as consistent as we can hope for?
Code of Conduct
- [X] I agree to follow cuGraph's Code of Conduct
- [X] I have searched the open issues and have found no duplicates for this question
@naimnv - can you check this out? I haven't looked at how we are using the random numbers in Leiden and how deterministic we should expect things to be. I think it's a reasonable expectation (in the abstract) that if you provide the same random number state to begin an algorithm and run it on the same data that you would get the same result. If there is something that we're doing that makes the algorithm non-deterministic (even when including the state of the random number generator as one of the controlled inputs), we should probably understand and document this.
@jpintar - Is there a reproducer you could share to jump start our analysis? Ideally something that loads a graph, runs your 10 runs and demonstrates the variations you are seeing.
Here is a zipped npz
file containing the adjacency matrix of a 100,000 vertex graph. The quick-and-dirty code below will illustrate the issue:
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import scipy as sp
import cugraph as cg
import itertools
import random
from sklearn.metrics import adjusted_rand_score
def set_seeds(seed):
random.seed(seed)
np.random.seed(seed)
cp.random.seed(seed)
# Function to repeatedly perform cugraph.leiden and return
# data frames with stats and partition assignments for each run
def repeat_leiden(graph, iterations, random_state, max_iter):
leiden_stats = []
leiden_partitions = []
for i in range(iterations):
# Set all seeds out of precaution
set_seeds(random_state)
# Run cugraph.leiden with fixed random_state
parts, mod = cg.leiden(graph, random_state=random_state, max_iter=max_iter)
leiden_stats.append(
{
'modularity': mod,
'partition_count': parts['partition'].nunique()
}
)
leiden_partitions.append(
parts.sort_values('vertex').to_pandas()['partition']
)
leiden_stats = pd.DataFrame(leiden_stats)
leiden_stats.index.name = 'run'
leiden_partitions = pd.concat(
leiden_partitions,
axis=1,
keys=[f"partition_{i}" for i in range(iterations)]
).reset_index(drop=True)
leiden_partitions.index.name = 'vertex'
return leiden_stats, leiden_partitions
# Function to compute adjusted Rand index between pairs of runs
def compute_rand_scores(partitions_df):
rand_scores = np.full((partitions_df.shape[1] - 1, partitions_df.shape[1]), np.nan)
for i,j in itertools.combinations(range(partitions_df.shape[1]), 2):
rand_scores[i, j] = adjusted_rand_score(
partitions_df[f"partition_{i}"],
partitions_df[f"partition_{j}"]
)
rand_scores = np.delete(rand_scores, 0, axis=1)
return rand_scores
# Load adjacency matrix and generate graph
adjacency = sp.sparse.load_npz('./adjacency.npz')
offsets = cd.Series(adjacency.indptr)
indices = cd.Series(adjacency.indices)
weights = cd.Series(adjacency.data)
g = cg.Graph()
g.from_cudf_adjlist(offsets, indices, weights)
# Perform clustering and compute Rand scores
leiden_stats, leiden_partitions = repeat_leiden(g, iterations=10, random_state=1234, max_iter=10000)
rand_scores = compute_rand_scores(leiden_partitions)
# Convert adjacency matrix to float64 and repeat
adjacency_64 = adjacency.astype('float64')
offsets_64 = cd.Series(adjacency_64.indptr)
indices_64 = cd.Series(adjacency_64.indices)
weights_64 = cd.Series(adjacency_64.data)
g_64 = cg.Graph()
g_64.from_cudf_adjlist(offsets_64, indices_64, weights_64)
leiden_stats_64, leiden_partitions_64 = repeat_leiden(g_64, iterations=10, random_state=1234, max_iter=10000)
rand_scores_64 = compute_rand_scores(leiden_partitions_64)
# Print results
np.set_printoptions(linewidth=150)
print(f"Min. modularity (float32): {leiden_stats['modularity'].min()}")
print(f"Mean modularity (float32): {leiden_stats['modularity'].mean()}")
print(f"Max. modularity (float32): {leiden_stats['modularity'].max()}\n")
print(f"Min. partition count (float32): {leiden_stats['partition_count'].min()}")
print(f"Max. partition count (float32): {leiden_stats['partition_count'].max()}\n")
print('Adjusted Rand index matrix (float32):')
print(rand_scores)
print(f"\nMin. adjusted Rand index (float32): {np.nanmin(rand_scores)}")
print(f"Mean adjusted Rand index (float32): {np.nanmean(rand_scores)}")
print(f"Max. adjusted Rand index (float32): {np.nanmax(rand_scores)}\n\n")
print(f"Min. modularity (float64): {leiden_stats_64['modularity'].min()}")
print(f"Mean modularity (float64): {leiden_stats_64['modularity'].mean()}")
print(f"Max. modularity (float64): {leiden_stats_64['modularity'].max()}\n")
print(f"Min. partition count (float64): {leiden_stats_64['partition_count'].min()}")
print(f"Max. partition count (float64): {leiden_stats_64['partition_count'].max()}\n")
print('Adjusted Rand index matrix (float64):')
print(rand_scores)
print(f"\nMin. adjusted Rand index (float64): {np.nanmin(rand_scores_64)}")
print(f"Mean adjusted Rand index (float64): {np.nanmean(rand_scores_64)}")
print(f"Max. adjusted Rand index (float64): {np.nanmax(rand_scores_64)}")
When I run this on an NVIDIA A100, a typical output looks like this:
Min. modularity (float32): 0.8951270580291748
Mean modularity (float32): 0.8961624324321746
Max. modularity (float32): 0.8971302509307861
Min. partition count (float32): 40
Max. partition count (float32): 44
Adjusted Rand index matrix (float32):
[[0.75088924 0.73472888 0.70018278 0.75595923 0.71810147 0.70763418 0.72208604 0.69235962 0.71466045]
[ nan 0.72100122 0.65890278 0.71082352 0.70139466 0.67741928 0.7483544 0.6896195 0.70494497]
[ nan nan 0.69125541 0.75972938 0.75688551 0.72175652 0.72950286 0.73229129 0.7107452 ]
[ nan nan nan 0.7182471 0.71213015 0.73381263 0.61834689 0.75539244 0.71371697]
[ nan nan nan nan 0.75071688 0.6976863 0.72713385 0.70082697 0.72347295]
[ nan nan nan nan nan 0.74508279 0.68753365 0.75040769 0.70460173]
[ nan nan nan nan nan nan 0.63538147 0.79620864 0.72844381]
[ nan nan nan nan nan nan nan 0.63022071 0.72055443]
[ nan nan nan nan nan nan nan nan 0.73431252]]
Min. adjusted Rand index (float32): 0.6183468875962401
Mean adjusted Rand index (float32): 0.7154546432734051
Max. adjusted Rand index (float32): 0.7962086419196115
Min. modularity (float64): 0.8953406848341683
Mean modularity (float64): 0.8962776434572689
Max. modularity (float64): 0.8976905676365944
Min. partition count (float64): 41
Max. partition count (float64): 44
Adjusted Rand index matrix (float64):
[[0.75088924 0.73472888 0.70018278 0.75595923 0.71810147 0.70763418 0.72208604 0.69235962 0.71466045]
[ nan 0.72100122 0.65890278 0.71082352 0.70139466 0.67741928 0.7483544 0.6896195 0.70494497]
[ nan nan 0.69125541 0.75972938 0.75688551 0.72175652 0.72950286 0.73229129 0.7107452 ]
[ nan nan nan 0.7182471 0.71213015 0.73381263 0.61834689 0.75539244 0.71371697]
[ nan nan nan nan 0.75071688 0.6976863 0.72713385 0.70082697 0.72347295]
[ nan nan nan nan nan 0.74508279 0.68753365 0.75040769 0.70460173]
[ nan nan nan nan nan nan 0.63538147 0.79620864 0.72844381]
[ nan nan nan nan nan nan nan 0.63022071 0.72055443]
[ nan nan nan nan nan nan nan nan 0.73431252]]
Min. adjusted Rand index (float64): 0.6581237251290206
Mean adjusted Rand index (float64): 0.7217165835712174
Max. adjusted Rand index (float64): 0.8139315511881964
So even after setting random_state
to a constant (and other random seeds, too, just in case), there is considerable variation between runs. As can be seen, using float64
doesn't improve things, either. All this is with cuGraph 23.12.00
.
I will take a look at it tomorrow
@jpintar Could you please provide us with the smallest adjacency matrix [perhaps sub-sampled] where cluster assignments is different, ie adjusted Rand index is less than 1.0 ?
For the adjacency matrix that you provided, locally I get similar modularity (differs by 1e-03 from run to run)
I've been testing adjacency matrices of different sizes and I'm getting some Rand index values less than 1.0 at 1100 vertices (but not at 1000 vertices). A representative run:
Min. modularity (float32): 0.8092981576919556
Mean modularity (float32): 0.809576016664505
Max. modularity (float32): 0.8100026249885559
Min. partition count (float32): 15
Max. partition count (float32): 16
Adjusted Rand index matrix (float32):
[[1. 0.97365048 0.92694001 1. 1. 0.92694001 1. 0.98764321 0.98764321]
[ nan 0.97365048 0.92694001 1. 1. 0.92694001 1. 0.98764321 0.98764321]
[ nan nan 0.90061508 0.97365048 0.97365048 0.90061508 0.97365048 0.98588849 0.98588849]
[ nan nan nan 0.92694001 0.92694001 1. 0.92694001 0.91428736 0.91428736]
[ nan nan nan nan 1. 0.92694001 1. 0.98764321 0.98764321]
[ nan nan nan nan nan 0.92694001 1. 0.98764321 0.98764321]
[ nan nan nan nan nan nan 0.92694001 0.91428736 0.91428736]
[ nan nan nan nan nan nan nan 0.98764321 0.98764321]
[ nan nan nan nan nan nan nan nan 1. ]]
Min. adjusted Rand index (float32): 0.9006150827613733
Mean adjusted Rand index (float32): 0.9654275826626596
Max. adjusted Rand index (float32): 1.0
Min. modularity (float64): 0.8092981003204439
Mean modularity (float64): 0.8093171918721854
Max. modularity (float64): 0.8093935580791511
Min. partition count (float64): 15
Max. partition count (float64): 16
Adjusted Rand index matrix (float64):
[[0.98764321 1. 0.98764321 1. 1. 1. 1. 1. 1. ]
[ nan 0.98764321 1. 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321]
[ nan nan 0.98764321 1. 1. 1. 1. 1. 1. ]
[ nan nan nan 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321]
[ nan nan nan nan 1. 1. 1. 1. 1. ]
[ nan nan nan nan nan 1. 1. 1. 1. ]
[ nan nan nan nan nan nan 1. 1. 1. ]
[ nan nan nan nan nan nan nan 1. 1. ]
[ nan nan nan nan nan nan nan nan 1. ]]
Min. adjusted Rand index (float64): 0.987643214839121
Mean adjusted Rand index (float64): 0.995606476387243
Max. adjusted Rand index (float64): 1.0
By the way, these are all KNN-graphs, generated from subsamples of real scRNA-seq data. The zipped npz
file is attached.
Have you been able to make any progress on this?
I am running into the same problem while running Leiden clustering on a KNN graph constructed with exact neighbors.
@jpintar @mbruhns
Thank you for reporting the issue. We have a fix that got merged in brach-24.04
, and one can use brach-24.04
nightly to run the latest version of the leiden algorithm.
With the latest code, we don't see any randomness for 1100 nodes any more. However, we still see some randomness (significantly less than before) for the input graph with 100,000 nodes that @jpintar shared with us. We are still working on it to figure out the possible sources of randomness.
Meanwhile, could you please check if the adjusted rand index is in acceptable range for your use cases?
@jpintar Could you please find the smallest graph where adjusted rand index is less than 1.0?
Thank you.
Thank you for looking into this more! With version 24.04.00a49
, the smallest graph where I see randomness is 3200 nodes
(zipped npz
file). Here is a typical set of runs for that graph:
Min. modularity (float32): 0.8323758244514465
Mean modularity (float32): 0.8323758363723754
Max. modularity (float32): 0.8323759436607361
Min. partition count (float32): 21
Max. partition count (float32): 21
[[0.9803510505 0.9898394315 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505]
[ nan 0.9904819326 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
[ nan nan 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326]
[ nan nan nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
[ nan nan nan nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
[ nan nan nan nan nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
[ nan nan nan nan nan nan 1.0000000000 1.0000000000 1.0000000000]
[ nan nan nan nan nan nan nan 1.0000000000 1.0000000000]
[ nan nan nan nan nan nan nan nan 1.0000000000]]
Min. adjusted Rand index (float32): 0.9803510505416672
Mean adjusted Rand index (float32): 0.9945889621392852
Max. adjusted Rand index (float32): 1.0
Min. modularity (float64): 0.8323758339478304
Mean modularity (float64): 0.8323758339478304
Max. modularity (float64): 0.8323758339478304
Min. partition count (float64): 21
Max. partition count (float64): 21
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[nan 1. 1. 1. 1. 1. 1. 1. 1.]
[nan nan 1. 1. 1. 1. 1. 1. 1.]
[nan nan nan 1. 1. 1. 1. 1. 1.]
[nan nan nan nan 1. 1. 1. 1. 1.]
[nan nan nan nan nan 1. 1. 1. 1.]
[nan nan nan nan nan nan 1. 1. 1.]
[nan nan nan nan nan nan nan 1. 1.]
[nan nan nan nan nan nan nan nan 1.]]
Min. adjusted Rand index (float64): 1.0
Mean adjusted Rand index (float64): 1.0
Max. adjusted Rand index (float64): 1.0
Like you, I see improvement for the 100,000 node graph, especially for float64
:
Min. modularity (float32): 0.8949086666107178
Mean modularity (float32): 0.896155321598053
Max. modularity (float32): 0.8966082334518433
Min. partition count (float32): 42
Max. partition count (float32): 43
[[0.6723109469 0.9709272313 0.6737672732 0.9620348230 0.8459497711 0.9620285832 0.7872396727 0.9631560364 0.8458706805]
[ nan 0.6747463127 0.9738234400 0.6784998752 0.6399577699 0.6729823392 0.7119539592 0.6746788762 0.6365549299]
[ nan nan 0.6733504544 0.9786646319 0.8622556460 0.9776790525 0.7935470305 0.9825837654 0.8562050057]
[ nan nan nan 0.6790135984 0.6397949990 0.6740246784 0.7162443494 0.6736307983 0.6351064191]
[ nan nan nan nan 0.8578959644 0.9802567737 0.7912772208 0.9728043958 0.8475066575]
[ nan nan nan nan nan 0.8554273786 0.7674377436 0.8518021370 0.9817595718]
[ nan nan nan nan nan nan 0.7939334290 0.9803393581 0.8487181300]
[ nan nan nan nan nan nan nan 0.7918065473 0.7654208202]
[ nan nan nan nan nan nan nan nan 0.8470709990]]
Min. adjusted Rand index (float32): 0.635106419080863
Mean adjusted Rand index (float32): 0.8093786683616847
Max. adjusted Rand index (float32): 0.9825837653527872
Min. modularity (float64): 0.8965980895995945
Mean modularity (float64): 0.8965980895995951
Max. modularity (float64): 0.8965980895995959
Min. partition count (float64): 42
Max. partition count (float64): 42
[[0.9519281421 0.9421814386 0.9544243928 0.9513851152 0.9505608662 0.9509022129 0.9458599876 0.9556839480 0.9462716129]
[ nan 0.9566963678 0.9751989656 0.9695654552 0.9698496606 0.9639583974 0.9701558107 0.9658491024 0.9681307955]
[ nan nan 0.9592236242 0.9541186566 0.9647041779 0.9648726289 0.9588673504 0.9519356888 0.9538477944]
[ nan nan nan 0.9788656461 0.9820126071 0.9732194225 0.9715958099 0.9801581983 0.9723216578]
[ nan nan nan nan 0.9728891017 0.9736846287 0.9690093318 0.9835770019 0.9685012892]
[ nan nan nan nan nan 0.9737014577 0.9797423804 0.9745889712 0.9718622543]
[ nan nan nan nan nan nan 0.9688776565 0.9706534031 0.9655177203]
[ nan nan nan nan nan nan nan 0.9709590313 0.9851334883]
[ nan nan nan nan nan nan nan nan 0.9725410395]]
Min. adjusted Rand index (float64): 0.9421814386490993
Mean adjusted Rand index (float64): 0.9656796509007762
Max. adjusted Rand index (float64): 0.9851334883268393
For my purposes, the new float64
performance—i.e. worst case adjusted Rand index ~0.95 and no variation in partition number—is just about tolerable, though not ideal. But the float32
performance is still not acceptable (and generally, no variation in the number of partitions will be acceptable).
Thank you for checking out the latest version. We're still investigating it to find out the remaining cause of the variation. We'll get back to you as soon as we know more.
Is there any progress on this? I find myself having to run a lot of Leiden clustering these days, and it would be great to be able to reliably do this on GPU, but as I said, without runs being consistent to at least RI around 0.95, preferably more, that's just not an option...
Hi, We'll review the code again, aiming to eliminate any additional sources of randomness other than numerical precision. I hope to get back to you soon.
We were able to identify and eliminate the variability. It turns out the Leiden implementation was fine, but the collapsing of the dendrogram into the final result was non-deterministic. The information stored in the dendrogram was ambiguous, it was missing some information that was known during algorithm execution that would allow for the dendrogram to be flattened deterministically.
We modified Leiden to store within the dendrogram results that included sufficient detail to be able to return a deterministic flattening, thus giving us consistent results.
Please retest with the latest 24.06 code (either from source or the nightly builds). If you have any further issues, open a new issue and let us know.
@ChuckHastings, thank you for working on this! I've just tested with version 24.06.00a43
. The issue is resolved completely when the adjacency matrix is float64
, but there is still some variability, although much less, when it's float32
and the graph is large (75,000 nodes). I'm attaching the zipped npz file of the adjacency matrix. Here are the results I'm getting from 10 runs, as before:
Min. modularity (float32): 0.8989160060882568
Mean modularity (float32): 0.8994227409362793
Max. modularity (float32): 0.9014488458633423
Min. partition count (float32): 67
Max. partition count (float32): 68
[[1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
[ nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
[ nan nan 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
[ nan nan nan 0.8200713715 0.8200713715 0.8200713715 1.0000000000 0.8200713715 0.8200713715]
[ nan nan nan nan 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
[ nan nan nan nan nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
[ nan nan nan nan nan nan 0.8200713715 1.0000000000 1.0000000000]
[ nan nan nan nan nan nan nan 0.8200713715 0.8200713715]
[ nan nan nan nan nan nan nan nan 1.0000000000]]
Min. adjusted Rand index (float32): 0.8200713715385887
Mean adjusted Rand index (float32): 0.9360253765470538
Max. adjusted Rand index (float32): 1.0
Since this is probably a different problem than the dendrogram flattening that you fixed, should I open a new issue? Or can you reopen this one?
I create a new issue for you with your comment above. We will investigate there, since - as you suggest - it's likely a different problem with similar symptoms.