pygraphistry [FEA] secondary index for node size

Is your feature request related to a problem? Please describe.

In categorical settings, default node sizes should often be normalized per category, not globally, as they may be sampled from different distributions and thus relative comparison doesn't make sense

Describe the solution you'd like

def with_degree_type_sizing(base_g):

    #log-scale 'g2._nodes.degree'
    g2 = base_g.nodes(base_g._nodes[[base_g._node, 'type']].reset_index(drop=True)).get_degrees()
    g2 = g2.nodes(g2._nodes.assign(degree=pd.Series(np.log(g2._nodes['degree'].values))))
    degs = g2._nodes.degree

    min_scale = 0.5
    type_to_min_degree = g2._nodes.groupby('type').agg({'degree': 'min'}).to_dict()['degree']
    type_to_max_degree = g2._nodes.groupby('type').agg({'degree': 'max'}).to_dict()['degree']
    mns = g2._nodes['type'].map(type_to_min_degree)
    mxs = g2._nodes['type'].map(type_to_max_degree)
    multiplier = (degs - mns + 1.0) / (mxs - mns + 1.0)
    multiplier = min_scale + (1 - min_scale) * multiplier

    sizes = pd.Series(degs * multiplier)

    return (base_g
            .nodes(base_g._nodes.reset_index().assign(sz=sizes))
            .bind(point_size='sz'))

Describe alternatives you've considered

additional opts

specify primary dimension (vs degree) and secondary (vs type)
specify min
specify preconditioner (log, ..)

server support

in general, for multi-client
in legend
in ui

pygraphistry is a good start as can ideally reuse

Apr 14 '22 07:04 lmeyerov

Generalized a bit:

import math, numpy as np

def with_type_sizing(base_g, scale_col = "degree", partition_col = 'type', scaler = None, ignore_existing_scale_col=False):
    """
        Add a point_size encoding based on scale_col, where elements of the same partition_col values are normalized relative to one another

        scale_col is either in the original nodes table, or if "degree", "degree_in", and "degree_out", will be auto-generated

        For example, if there are nodes of a few different parition_col="type", and size is based on scale_col="degree",
        in case max type="a" degree is 10M while max type="b" degree is only 20, this ensures a max-degree type="b" is still big,
        even though its degree is tiny relative to type="a" nodes.

        :param scale_col: name of numeric column in g._nodes or one of "degree", "degree_in", "degree_out"
        :param partition_col: name of column to use for splitting nodes into comparison groups
        :param scaler': a transform over partition_col before normalizing -- None, 'log' (natural), 'log2', 'log10'
        :param ignore_existing_scale_col: if an existing column like degree, ignore

    """

    print({'scale_col': scale_col, 'partition_col': partition_col, 'scaler': scaler, 'ignore_existing_scale_col': ignore_existing_scale_col})

    if not partition_col in base_g._nodes:
        raise ValueError(f'partition_col must be in node columns, could not find: {partition_col}')

    #special case scale_col=degrees
    if not ignore_existing_scale_col and scale_col in base_g._nodes:
        g2 = base_g.nodes(base_g._nodes[[base_g._node, partition_col, scale_col]].reset_index(drop=True))
    elif scale_col in ["degree", "degree_in", "degree_out"]:
        g2 = base_g.nodes(base_g._nodes[[base_g._node, partition_col]].reset_index(drop=True))
        if scale_col == 'degree':
            g2 = g2.get_degrees()
        elif scale_col == 'degree_in':
            g2 = g2.get_indegrees()
        else:
            g2 = g2.get_outdegrees()
    else:
        raise ValueError(f'Unexpected parameter scale_col={scale_col}; should be in g._nodes or one of "degree", "degree_in", "degree_out"')

    #precondition scale_col
    scaled = g2._nodes[scale_col].fillna(0) + 1
    print('scaled pre', scaled)
    if scaler is not None:
        ops = {'log': np.log, 'log2': np.log2, 'log10': np.log10}
        if scaler in ops.keys():
            scaled = pd.Series((ops[scaler])(scaled.values))
        else:
            op_names = ','.join(ops.keys())
            raise ValueError(f'scaler must be None, {op_names}; received: {scaler}')
    g3 = g2.nodes(g2._nodes.assign(**{scale_col: scaled}))
    print('scaled post', scaled)

    min_scale = 0.5
    type_to_min = g3._nodes.groupby(partition_col).agg({scale_col: 'min'}).to_dict()[scale_col]
    type_to_max = g3._nodes.groupby(partition_col).agg({scale_col: 'max'}).to_dict()[scale_col]
    mns = g3._nodes[partition_col].map(type_to_min)
    mxs = g3._nodes[partition_col].map(type_to_max)
    multiplier = (scaled - mns + 1.0) / (mxs - mns + 1.0)
    multiplier = min_scale + (1 - min_scale) * multiplier

    sizes = pd.Series(scaled * multiplier)

    return (base_g
            .nodes(base_g._nodes.reset_index().assign(sz=sizes))
            .bind(point_size='sz'))

Apr 14 '22 21:04 lmeyerov

See also new impute_and_scale_matrix (ai branch) for additional auto-normalization options

Apr 14 '22 21:04 lmeyerov

This is awesome!

Can elif scale_col in ["degree", "degree_in", "degree_out"]: be changed to if scale_col in ["degree", "degree_in", "degree_out"]: to remove the need for ignore_existing_scale_col?

I also think it would be cool if you could derive sizes based on a weighted combination of degree AND some node property

Jun 08 '22 18:06 matthewatbb

Something like scale_cols = [ {'degree_in': 3}, {'degree_out': 2}, {'node_property': 1} ] where the numbers here are weights

Jun 08 '22 18:06 matthewatbb

multiple attributes, such as one as a relative importance factor, is interesting... good idea! will want to play with some scenarios, like degree + score, pagerank + risk, ... alex was thinking before of just straight up custom formulas. i like your relative weight idea too. something else might be type-based, like nested scale_cols = {'type': {'user': 'degree', 'url': 'score'}}....
elif.. i'm not quite following. The current goal in that block is supporting a few scenarios:
- with_type_sizing(g): degree sizing, and b/c not there yet, get them computed for you
- with_type_sizing(g, ignore_existing_scale_col=True): if you already have degree precomputed, as an opt-in optimization, skip the recalc
- ``with_type_sizing(g, 'mycol')` use some other existing col
- ... what other scenario are you thinking / trying to avoid?

Jun 08 '22 19:06 lmeyerov

Ah I see - I was only thinking about the first scenario

It could maybe be checked somehow within the function like scale_col in ["degree", "degree_in", "degree_out"] and scale_col not in base_g._nodes? Doesn't change much though of course 🙂

Jun 08 '22 19:06 matthewatbb

pygraphistry pygraphistry copied to clipboard

[FEA] secondary index for node size

pygraphistry
pygraphistry copied to clipboard