cugraph [FEA]: Markov Clustering for large protein similarity networks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem this feature solves

I would like cuGraph to implement the Markov clustering algorithm, because I would like to perform such a clustering on my graphs where, nodes are proteins, edges denote similarity between the nodes and edge weights are similarities. MCL is a commonly used algorithm in analyzing protein similarity networks. A recent publication performed such a clustering on a graph that consisted of 570,198,677 nodes and 5,196,499,560 edges using a distributed version of the algorithm, HipMCL. They required 2,500 compute nodes (170,000 compute cores) on the NERSC super computer for 3h20m. I estimate my graph to be around the same order of nodes and edges (if not more), but I don't have access to a supercomputer. I do have access to AWS and GPU instances. I should also add that I am completely new to the RAPIDS universe of tools, but quite excited and eager to learn.

Describe your ideal solution

The new function takes an existing graph as described above. The function would perform the following steps (borrowed from this repo)

import markov_clustering as mc
import networkx as nx
import random
from itertools import pairwise

def get_edge_list(nodes: list) -> list:
    numnodes = len(nodes)
    assert numnodes >= 2, f"Nodes list isn't long enough: {nodes}"
    return [(u, v, random.uniform(0, 1)) for u, v in pairwise(nodes) if u != v]

G = nx.DiGraph()
nodes = [i for i in range(1,101)]
edges = get_edge_list(nodes)
G.add_weighted_edges_from(edges)

# then get the adjacency matrix (in sparse form)
matrix = nx.to_scipy_sparse_matrix(G)

# cluster
result = mc.run_mcl(matrix, inflation=2)
clusters = mc.get_clusters(result)

# optionally tie into the visualization
mc.draw_graph(matrix, clusters, pos=positions, node_size=50, with_labels=False, edge_color="silver")

Describe any alternatives you have considered

I have considered:

https://micans.org/mcl/
https://github.com/guyallard/markov_clustering
https://pubmed.ncbi.nlm.nih.gov/29315405/

but none of them have GPU acceleration and I don't have the resources to rent the NERSC supercomputer for 4 hours.

Additional context

https://github.com/guyallard/markov_clustering
https://pubmed.ncbi.nlm.nih.gov/29315405/

Code of Conduct

[X] I agree to follow cuGraph's Code of Conduct
[X] I have searched the open feature requests and have found no duplicates for this feature request

Oct 12 '23 19:10 sunitj

Thanks for your interest!

We will review the linked papers and consider how this could be done in cuGraph and update this issue with information on where it might fit in our road map.

Oct 14 '23 16:10 ChuckHastings

I am also very much interested in an easily accessible GPU-accelerated MCL implementation!

Oct 24 '23 17:10 vincentvandeuren

cugraph cugraph copied to clipboard

[FEA]: Markov Clustering for large protein similarity networks

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

cugraph
cugraph copied to clipboard