pygraphistry
pygraphistry copied to clipboard
[FEA] secondary index for node size
Is your feature request related to a problem? Please describe.
In categorical settings, default node sizes should often be normalized per category, not globally, as they may be sampled from different distributions and thus relative comparison doesn't make sense
Describe the solution you'd like
def with_degree_type_sizing(base_g):
#log-scale 'g2._nodes.degree'
g2 = base_g.nodes(base_g._nodes[[base_g._node, 'type']].reset_index(drop=True)).get_degrees()
g2 = g2.nodes(g2._nodes.assign(degree=pd.Series(np.log(g2._nodes['degree'].values))))
degs = g2._nodes.degree
min_scale = 0.5
type_to_min_degree = g2._nodes.groupby('type').agg({'degree': 'min'}).to_dict()['degree']
type_to_max_degree = g2._nodes.groupby('type').agg({'degree': 'max'}).to_dict()['degree']
mns = g2._nodes['type'].map(type_to_min_degree)
mxs = g2._nodes['type'].map(type_to_max_degree)
multiplier = (degs - mns + 1.0) / (mxs - mns + 1.0)
multiplier = min_scale + (1 - min_scale) * multiplier
sizes = pd.Series(degs * multiplier)
return (base_g
.nodes(base_g._nodes.reset_index().assign(sz=sizes))
.bind(point_size='sz'))
Describe alternatives you've considered
additional opts
- specify primary dimension (vs degree) and secondary (vs type)
- specify min
- specify preconditioner (log, ..)
server support
- in general, for multi-client
- in legend
- in ui
pygraphistry is a good start as can ideally reuse
Generalized a bit:
import math, numpy as np
def with_type_sizing(base_g, scale_col = "degree", partition_col = 'type', scaler = None, ignore_existing_scale_col=False):
"""
Add a point_size encoding based on scale_col, where elements of the same partition_col values are normalized relative to one another
scale_col is either in the original nodes table, or if "degree", "degree_in", and "degree_out", will be auto-generated
For example, if there are nodes of a few different parition_col="type", and size is based on scale_col="degree",
in case max type="a" degree is 10M while max type="b" degree is only 20, this ensures a max-degree type="b" is still big,
even though its degree is tiny relative to type="a" nodes.
:param scale_col: name of numeric column in g._nodes or one of "degree", "degree_in", "degree_out"
:param partition_col: name of column to use for splitting nodes into comparison groups
:param scaler': a transform over partition_col before normalizing -- None, 'log' (natural), 'log2', 'log10'
:param ignore_existing_scale_col: if an existing column like degree, ignore
"""
print({'scale_col': scale_col, 'partition_col': partition_col, 'scaler': scaler, 'ignore_existing_scale_col': ignore_existing_scale_col})
if not partition_col in base_g._nodes:
raise ValueError(f'partition_col must be in node columns, could not find: {partition_col}')
#special case scale_col=degrees
if not ignore_existing_scale_col and scale_col in base_g._nodes:
g2 = base_g.nodes(base_g._nodes[[base_g._node, partition_col, scale_col]].reset_index(drop=True))
elif scale_col in ["degree", "degree_in", "degree_out"]:
g2 = base_g.nodes(base_g._nodes[[base_g._node, partition_col]].reset_index(drop=True))
if scale_col == 'degree':
g2 = g2.get_degrees()
elif scale_col == 'degree_in':
g2 = g2.get_indegrees()
else:
g2 = g2.get_outdegrees()
else:
raise ValueError(f'Unexpected parameter scale_col={scale_col}; should be in g._nodes or one of "degree", "degree_in", "degree_out"')
#precondition scale_col
scaled = g2._nodes[scale_col].fillna(0) + 1
print('scaled pre', scaled)
if scaler is not None:
ops = {'log': np.log, 'log2': np.log2, 'log10': np.log10}
if scaler in ops.keys():
scaled = pd.Series((ops[scaler])(scaled.values))
else:
op_names = ','.join(ops.keys())
raise ValueError(f'scaler must be None, {op_names}; received: {scaler}')
g3 = g2.nodes(g2._nodes.assign(**{scale_col: scaled}))
print('scaled post', scaled)
min_scale = 0.5
type_to_min = g3._nodes.groupby(partition_col).agg({scale_col: 'min'}).to_dict()[scale_col]
type_to_max = g3._nodes.groupby(partition_col).agg({scale_col: 'max'}).to_dict()[scale_col]
mns = g3._nodes[partition_col].map(type_to_min)
mxs = g3._nodes[partition_col].map(type_to_max)
multiplier = (scaled - mns + 1.0) / (mxs - mns + 1.0)
multiplier = min_scale + (1 - min_scale) * multiplier
sizes = pd.Series(scaled * multiplier)
return (base_g
.nodes(base_g._nodes.reset_index().assign(sz=sizes))
.bind(point_size='sz'))
See also new impute_and_scale_matrix (ai branch) for additional auto-normalization options
This is awesome!
Can elif scale_col in ["degree", "degree_in", "degree_out"]: be changed to if scale_col in ["degree", "degree_in", "degree_out"]: to remove the need for ignore_existing_scale_col?
I also think it would be cool if you could derive sizes based on a weighted combination of degree AND some node property
Something like scale_cols = [ {'degree_in': 3}, {'degree_out': 2}, {'node_property': 1} ] where the numbers here are weights
-
multiple attributes, such as one as a relative importance factor, is interesting... good idea! will want to play with some scenarios, like
degree+score,pagerank+risk, ... alex was thinking before of just straight up custom formulas. i like your relative weight idea too. something else might be type-based, like nestedscale_cols = {'type': {'user': 'degree', 'url': 'score'}}.... -
elif.. i'm not quite following. The current goal in that block is supporting a few scenarios:with_type_sizing(g): degree sizing, and b/c not there yet, get them computed for youwith_type_sizing(g, ignore_existing_scale_col=True): if you already havedegreeprecomputed, as an opt-in optimization, skip the recalc- ``with_type_sizing(g, 'mycol')` use some other existing col
- ... what other scenario are you thinking / trying to avoid?
Ah I see - I was only thinking about the first scenario
It could maybe be checked somehow within the function like scale_col in ["degree", "degree_in", "degree_out"] and scale_col not in base_g._nodes? Doesn't change much though of course 🙂