xgi
xgi copied to clipboard
Differentiate between numeric and categorical statistics
Currently, the stats package treats all NodeStat and EdgeStat objects the same. But some of them, e.g. node attributes, tend to be string-valued and thus categorical. In particular, the .mean(), .max(), .std()
and other similar methods shouldn't be defined for categorical statistics. For example:
H = xgi.Hypergraph([[1, 2, 3], [2, 3, 4, 5], [3, 4, 5]])
H.add_nodes_from(
[
(1, {"color": "red", "name": "horse"}),
(2, {"color": "blue", "name": "pony"}),
(3, {"color": "yellow", "name": "zebra"}),
(4, {"color": "red", "name": "orangutan", "age": 20}),
(5, {"color": "blue", "name": "fish", "age": 2}),
]
)
# Attributes can be accessed via the stats package
H.nodes.attrs('color').asdict()
# -> {1: 'red', 2: 'blue', 3: 'yellow', 4: 'red', 5: 'blue'}
# .mean() is defined, just like with any other NodeStat...
H.nodes.attrs('color').mean
# -> <bound method IDStat.mean of NodeStat('attrs', args=('color',))>
# but of course it raises an exception.
H.nodes.attrs('color').mean()
# -> TypeError: cannot perform reduce with flexible type
My main question is: what should we do? Should we subclass NodeStat
and define a CategoricalNodeStat
that doesn't implement .mean()
? Or should we add a check to NodeStat.mean
to make sure it's not trying to compute the mean of string-valued stats? Or something else?
I would add a check inside NodeStat.mean
, raising an error. We already have quite a few classes, and in this case the fix seems easy enough to not make me want to add one more.
The problem is that there are quite a few such methods: NodeStat.mean
, NodeStat.min
, NodeStat.max
, NodeStat.std
, and one might imagine implementing other similar ones in the future. Should each of these make the check every time? :thinking:
I do agree that we already have quite a few classes tho.
Personally I would make the check in each of these. The check can probably hold in 2-5 lines I guess, and we have 4 methods We might add a few more but I don't expect an explosion of that number any time soon? If we realise it becomes a problem, because we do have an explosion of methods, we can always change our strategy then. That would be my strategy at least :)
@maximelucas thank you, you have convinced me :) I will come back to implement this next week.
After today's meeting, this was the consensus:
- user-provided stats go into
H.nodes.attrs
- structural (i.e. NOT user-provided) stats go into their own stat e.g.
H.nodes.degree
- either can be numerical or categorical
- all attrs are separate: For now, to access two different attributes we must do
H.nodes.attrs("color")
andH.nodes.attrs("name")
, that is both color and name are being accessed via the same NodeStat objectattrs
. This is bad. Change this. - If a user tries to compute the mean of a categorical attribute, perhaps an easier path is to issue a warning and return NaN instead of refusing to compute it by raising an exception.
- implement
H.add_attribute
(instead of adding attributes viaadd_nodes_from
). This should allow the user to specify the type of each attribute only once, instead of once per node.
The current open PR #157 does not address this issues so I'm closing it.