pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

[FEA] anonymize graph

Open lmeyerov opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe.

When sharing graphs with others, especially via going from private server / private account -> public hub, such as for publicizing or debugging, it'd help to have a way to quickly anonymize a graph

Sample use cases to make fast:

  • show topology-only
  • with and without renaming topology identifiers
  • with and without renaming all cols
  • including/dropping specific columns
  • with/without preserving topology (prevent decloaking)
  • with/without preserving value distributions
  • as needed, opt in/out for particular columns

Perf:

  • fast for graphs < 10M nodes, edges
  • path to bigger graphs: if pandas, stick to vector ops, ...

Describe the solution you'd like

Something declarative and configurable like:

g2 = g.anonymize(
node_policy={
  'include': ['col1', ...],  # safelist of columns to include
  'preserve': ['col1', ...],  # opt-in columns not to anonymize,
  'rename': ['col1', ...] | True,
  'sample_drop': 0.2 # % nodes to drop; 0 (default) means preserve all 
  'sample_add': 0.2 # % nodes to add; 0 (default) means add none
},  
edge_policy={
  'drop': ['col2', ...]  # switch to opt-out via columns to exclude
},
sample_keep=..,
sample_add=...
)

g2.plot()

g_orig = g2.deanonymize(g2._anon_remapping)

Sample transforms:

  • rename columns
  • remap categoricals, including both range values & distribution, but preserve type
  • resample edges, both removing/adding
  • ... and shift topology distributions & supernode locations

If there is a popular tabular or graph centric library here that is well-maintained, we should consider using ... but not if it looks like maintenance or security risks

Additional context

Ultimately it'd be good to push this to the UI via some sort of safe mode: role-specific masking, ...

lmeyerov avatar Jul 29 '22 21:07 lmeyerov

Hello @lmeyerov 😇, I am interested in contributing to this, can you assign this issue to me? Any tips for where to start with ?

sky-2002 avatar Sep 09 '22 18:09 sky-2002

awesome!

  • Where it fits: it'd probably live in https://github.com/graphistry/pygraphistry/tree/master/graphistry/compute , you can see the pattern there of functional methods that take the graph (def some_method(self, ...): self._edges ...

  • Testing: run this script for whatever testing: https://github.com/graphistry/pygraphistry/blob/master/docker/test-cpu-local-minimal.sh

  • Implementation: Starting with a pandas-based impl is probably best, and later we can scale via dask/cudf/etc. the trick is to stick within vectorized pandas operations: https://pythonspeed.com/articles/pandas-vectorization/

  • Design: Maybe start with something minimal that we can land, and then we can grow the interface from there?

lmeyerov avatar Sep 09 '22 20:09 lmeyerov

(happy to review PRs as they happen!)

lmeyerov avatar Sep 09 '22 20:09 lmeyerov