xgi icon indicating copy to clipboard operation
xgi copied to clipboard

Differentiate between numeric and categorical statistics

Open leotrs opened this issue 2 years ago • 6 comments

Currently, the stats package treats all NodeStat and EdgeStat objects the same. But some of them, e.g. node attributes, tend to be string-valued and thus categorical. In particular, the .mean(), .max(), .std() and other similar methods shouldn't be defined for categorical statistics. For example:

H = xgi.Hypergraph([[1, 2, 3], [2, 3, 4, 5], [3, 4, 5]])
H.add_nodes_from(
        [
            (1, {"color": "red", "name": "horse"}),
            (2, {"color": "blue", "name": "pony"}),
            (3, {"color": "yellow", "name": "zebra"}),
            (4, {"color": "red", "name": "orangutan", "age": 20}),
            (5, {"color": "blue", "name": "fish", "age": 2}),
        ]
    )

# Attributes can be accessed via the stats package
H.nodes.attrs('color').asdict()
# -> {1: 'red', 2: 'blue', 3: 'yellow', 4: 'red', 5: 'blue'}

# .mean() is defined, just like with any other NodeStat...
H.nodes.attrs('color').mean
# -> <bound method IDStat.mean of NodeStat('attrs', args=('color',))>

# but of course it raises an exception.
H.nodes.attrs('color').mean()
# -> TypeError: cannot perform reduce with flexible type

leotrs avatar Jul 22 '22 07:07 leotrs

My main question is: what should we do? Should we subclass NodeStat and define a CategoricalNodeStat that doesn't implement .mean()? Or should we add a check to NodeStat.mean to make sure it's not trying to compute the mean of string-valued stats? Or something else?

leotrs avatar Jul 22 '22 07:07 leotrs

I would add a check inside NodeStat.mean, raising an error. We already have quite a few classes, and in this case the fix seems easy enough to not make me want to add one more.

maximelucas avatar Aug 04 '22 13:08 maximelucas

The problem is that there are quite a few such methods: NodeStat.mean, NodeStat.min, NodeStat.max, NodeStat.std, and one might imagine implementing other similar ones in the future. Should each of these make the check every time? :thinking:

I do agree that we already have quite a few classes tho.

leotrs avatar Aug 04 '22 13:08 leotrs

Personally I would make the check in each of these. The check can probably hold in 2-5 lines I guess, and we have 4 methods We might add a few more but I don't expect an explosion of that number any time soon? If we realise it becomes a problem, because we do have an explosion of methods, we can always change our strategy then. That would be my strategy at least :)

maximelucas avatar Aug 04 '22 13:08 maximelucas

@maximelucas thank you, you have convinced me :) I will come back to implement this next week.

leotrs avatar Aug 11 '22 12:08 leotrs

After today's meeting, this was the consensus:

  • user-provided stats go into H.nodes.attrs
  • structural (i.e. NOT user-provided) stats go into their own stat e.g. H.nodes.degree
  • either can be numerical or categorical
  • all attrs are separate: For now, to access two different attributes we must do H.nodes.attrs("color") and H.nodes.attrs("name"), that is both color and name are being accessed via the same NodeStat object attrs. This is bad. Change this.
  • If a user tries to compute the mean of a categorical attribute, perhaps an easier path is to issue a warning and return NaN instead of refusing to compute it by raising an exception.
  • implement H.add_attribute (instead of adding attributes via add_nodes_from). This should allow the user to specify the type of each attribute only once, instead of once per node.

The current open PR #157 does not address this issues so I'm closing it.

leotrs avatar Oct 14 '22 14:10 leotrs