pygraphistry [FEA] anonymize graph

[FEA] anonymize graph

Open lmeyerov opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe.

When sharing graphs with others, especially via going from private server / private account -> public hub, such as for publicizing or debugging, it'd help to have a way to quickly anonymize a graph

Sample use cases to make fast:

show topology-only
with and without renaming topology identifiers
with and without renaming all cols
including/dropping specific columns
with/without preserving topology (prevent decloaking)
with/without preserving value distributions
as needed, opt in/out for particular columns

Perf:

fast for graphs < 10M nodes, edges
path to bigger graphs: if pandas, stick to vector ops, ...

Describe the solution you'd like

Something declarative and configurable like:

g2 = g.anonymize(
node_policy={
  'include': ['col1', ...],  # safelist of columns to include
  'preserve': ['col1', ...],  # opt-in columns not to anonymize,
  'rename': ['col1', ...] | True,
  'sample_drop': 0.2 # % nodes to drop; 0 (default) means preserve all 
  'sample_add': 0.2 # % nodes to add; 0 (default) means add none
},  
edge_policy={
  'drop': ['col2', ...]  # switch to opt-out via columns to exclude
},
sample_keep=..,
sample_add=...
)

g2.plot()

g_orig = g2.deanonymize(g2._anon_remapping)

Sample transforms:

rename columns
remap categoricals, including both range values & distribution, but preserve type
resample edges, both removing/adding
... and shift topology distributions & supernode locations

If there is a popular tabular or graph centric library here that is well-maintained, we should consider using ... but not if it looks like maintenance or security risks

Additional context

Ultimately it'd be good to push this to the UI via some sort of safe mode: role-specific masking, ...

Jul 29 '22 21:07 lmeyerov

Hello @lmeyerov 😇, I am interested in contributing to this, can you assign this issue to me? Any tips for where to start with ?

Sep 09 '22 18:09 sky-2002

awesome!

Where it fits: it'd probably live in https://github.com/graphistry/pygraphistry/tree/master/graphistry/compute , you can see the pattern there of functional methods that take the graph (def some_method(self, ...): self._edges ...
Testing: run this script for whatever testing: https://github.com/graphistry/pygraphistry/blob/master/docker/test-cpu-local-minimal.sh
Implementation: Starting with a pandas-based impl is probably best, and later we can scale via dask/cudf/etc. the trick is to stick within vectorized pandas operations: https://pythonspeed.com/articles/pandas-vectorization/
Design: Maybe start with something minimal that we can land, and then we can grow the interface from there?

Sep 09 '22 20:09 lmeyerov

(happy to review PRs as they happen!)

Sep 09 '22 20:09 lmeyerov

pygraphistry pygraphistry copied to clipboard

[FEA] anonymize graph

pygraphistry
pygraphistry copied to clipboard