GraphRole icon indicating copy to clipboard operation
GraphRole copied to clipboard

Can GraphRole be used on large networks?

Open rjurney opened this issue 3 years ago • 8 comments

We are interested in using this on a billion node network. How well does it scale to large graphs? We can partition our network if required, but we don't know if this is a multi-core implementation via networkx or if this is something not likely to scale beyond small networks.

rjurney avatar Aug 31 '21 04:08 rjurney

Hi @rjurney, thanks for your interest in using GraphRole. This package hasn't been tested at the scale you mention and part of the implementation uses Pandas which might have problems at this scale.

One thing to note though is that GraphRole is not dependent on any particular graph library, so it can be integrated with any scalable graph library of your choice. All that needs to be done is to satisfy the required interface and make it discoverable. The steps are:

  1. Subclass the BaseGraphInterface class in graphrole.graph.interface.base.py and implement the required methods
  2. Update the INTERFACES dict in graphrole.graph.interface.__init__.py to make the new subclass discoverable

See full instructions in the README for setting up tests if so desired.

I'd be very interested to know how it works out if you go down this route, please keep me posted!

dkaslovsky avatar Aug 31 '21 12:08 dkaslovsky

@dkaslovsky thanks, this is really helpful. What you've done here is really cool and I am encouraging the Deep Discovery team to implement this using PySpark and GraphFrames and if we do we will contribute it back... but setting up testing and things may take some time. We'll do an intermediate PR to get things started. cc @ajs-dd

rjurney avatar Aug 31 '21 13:08 rjurney

That's really exciting to hear. I've thought about adding a more scalable dataframe library in the past, so I'm really excited that you and your team might look into implementing and I'd be grateful for any contribution back to GraphRole. Please let me know if there's any help I can provide along the way!

dkaslovsky avatar Aug 31 '21 13:08 dkaslovsky

Oh, one other thought I forgot to mention is that Dask might also be a good option to explore for distributed dataframe functionality.

dkaslovsky avatar Aug 31 '21 13:08 dkaslovsky

@dkaslovsky yeah, but we have a 1.5 billion node business graph so we need it to work across multiple machines and have graph rather than just DataFrame abstractions. This is why GraphFrames is really nice. It is on Spark and uses DataFrames but has graph operations.

https://graphframes.github.io/graphframes/docs/_site/index.html

rjurney avatar Aug 31 '21 14:08 rjurney

Ah, I see. A graphframes-based implementation sounds very appealing!

dkaslovsky avatar Aug 31 '21 14:08 dkaslovsky

@dkaslovsky in which PR? How?

rjurney avatar Mar 26 '23 21:03 rjurney

@rjurney Apologies, reopening, this was in error.

dkaslovsky avatar Mar 26 '23 21:03 dkaslovsky