pygraphistry
pygraphistry copied to clipboard
[FEA] anonymize graph
Is your feature request related to a problem? Please describe.
When sharing graphs with others, especially via going from private server / private account -> public hub, such as for publicizing or debugging, it'd help to have a way to quickly anonymize a graph
Sample use cases to make fast:
- show topology-only
- with and without renaming topology identifiers
- with and without renaming all cols
- including/dropping specific columns
- with/without preserving topology (prevent decloaking)
- with/without preserving value distributions
- as needed, opt in/out for particular columns
Perf:
- fast for graphs < 10M nodes, edges
- path to bigger graphs: if pandas, stick to vector ops, ...
Describe the solution you'd like
Something declarative and configurable like:
g2 = g.anonymize(
node_policy={
'include': ['col1', ...], # safelist of columns to include
'preserve': ['col1', ...], # opt-in columns not to anonymize,
'rename': ['col1', ...] | True,
'sample_drop': 0.2 # % nodes to drop; 0 (default) means preserve all
'sample_add': 0.2 # % nodes to add; 0 (default) means add none
},
edge_policy={
'drop': ['col2', ...] # switch to opt-out via columns to exclude
},
sample_keep=..,
sample_add=...
)
g2.plot()
g_orig = g2.deanonymize(g2._anon_remapping)
Sample transforms:
- rename columns
- remap categoricals, including both range values & distribution, but preserve type
- resample edges, both removing/adding
- ... and shift topology distributions & supernode locations
If there is a popular tabular or graph centric library here that is well-maintained, we should consider using ... but not if it looks like maintenance or security risks
Additional context
Ultimately it'd be good to push this to the UI via some sort of safe mode: role-specific masking, ...
Hello @lmeyerov 😇, I am interested in contributing to this, can you assign this issue to me? Any tips for where to start with ?
awesome!
-
Where it fits: it'd probably live in https://github.com/graphistry/pygraphistry/tree/master/graphistry/compute , you can see the pattern there of functional methods that take the graph (
def some_method(self, ...): self._edges ...
-
Testing: run this script for whatever testing: https://github.com/graphistry/pygraphistry/blob/master/docker/test-cpu-local-minimal.sh
-
Implementation: Starting with a pandas-based impl is probably best, and later we can scale via dask/cudf/etc. the trick is to stick within vectorized pandas operations: https://pythonspeed.com/articles/pandas-vectorization/
-
Design: Maybe start with something minimal that we can land, and then we can grow the interface from there?
(happy to review PRs as they happen!)