xgi Differentiate between numeric and categorical statistics

Currently, the stats package treats all NodeStat and EdgeStat objects the same. But some of them, e.g. node attributes, tend to be string-valued and thus categorical. In particular, the .mean(), .max(), .std() and other similar methods shouldn't be defined for categorical statistics. For example:

H = xgi.Hypergraph([[1, 2, 3], [2, 3, 4, 5], [3, 4, 5]])
H.add_nodes_from(
        [
            (1, {"color": "red", "name": "horse"}),
            (2, {"color": "blue", "name": "pony"}),
            (3, {"color": "yellow", "name": "zebra"}),
            (4, {"color": "red", "name": "orangutan", "age": 20}),
            (5, {"color": "blue", "name": "fish", "age": 2}),
        ]
    )

# Attributes can be accessed via the stats package
H.nodes.attrs('color').asdict()
# -> {1: 'red', 2: 'blue', 3: 'yellow', 4: 'red', 5: 'blue'}

# .mean() is defined, just like with any other NodeStat...
H.nodes.attrs('color').mean
# -> <bound method IDStat.mean of NodeStat('attrs', args=('color',))>

# but of course it raises an exception.
H.nodes.attrs('color').mean()
# -> TypeError: cannot perform reduce with flexible type

Jul 22 '22 07:07 leotrs

My main question is: what should we do? Should we subclass NodeStat and define a CategoricalNodeStat that doesn't implement .mean()? Or should we add a check to NodeStat.mean to make sure it's not trying to compute the mean of string-valued stats? Or something else?

Jul 22 '22 07:07 leotrs

I would add a check inside NodeStat.mean, raising an error. We already have quite a few classes, and in this case the fix seems easy enough to not make me want to add one more.

Aug 04 '22 13:08 maximelucas

The problem is that there are quite a few such methods: NodeStat.mean, NodeStat.min, NodeStat.max, NodeStat.std, and one might imagine implementing other similar ones in the future. Should each of these make the check every time? :thinking:

I do agree that we already have quite a few classes tho.

Aug 04 '22 13:08 leotrs

Personally I would make the check in each of these. The check can probably hold in 2-5 lines I guess, and we have 4 methods We might add a few more but I don't expect an explosion of that number any time soon? If we realise it becomes a problem, because we do have an explosion of methods, we can always change our strategy then. That would be my strategy at least :)

Aug 04 '22 13:08 maximelucas

@maximelucas thank you, you have convinced me :) I will come back to implement this next week.

Aug 11 '22 12:08 leotrs

After today's meeting, this was the consensus:

user-provided stats go into H.nodes.attrs
structural (i.e. NOT user-provided) stats go into their own stat e.g. H.nodes.degree
either can be numerical or categorical
all attrs are separate: For now, to access two different attributes we must do H.nodes.attrs("color") and H.nodes.attrs("name"), that is both color and name are being accessed via the same NodeStat object attrs. This is bad. Change this.
If a user tries to compute the mean of a categorical attribute, perhaps an easier path is to issue a warning and return NaN instead of refusing to compute it by raising an exception.
implement H.add_attribute (instead of adding attributes via add_nodes_from). This should allow the user to specify the type of each attribute only once, instead of once per node.

The current open PR #157 does not address this issues so I'm closing it.

Oct 14 '22 14:10 leotrs

xgi xgi copied to clipboard

Differentiate between numeric and categorical statistics

xgi
xgi copied to clipboard