xgi icon indicating copy to clipboard operation
xgi copied to clipboard

Add features to `degree_histogram`

Open nwlandry opened this issue 2 years ago • 6 comments

I wanted to easily visualize the degree/edge size distribution and the way that I found easiest to get both was the following:

degrees = dict(H.degree())
edge_sizes = dict(H.edge_size())

d, count_d = np.unique(list(degrees.values()), return_counts=True)
p_d = count_d/len(degrees)
e, count_e = np.unique(list(edge_sizes.values()), return_counts=True)
p_e = count_e/len(edge_sizes)

Currently, there is a way to do some of this with degree_histogram, but I would like to be able to a) specify either edge sizes or degrees and b) specify whether it is normalized. And perhaps we should return numpy arrays instead of lists.

nwlandry avatar Jun 03 '22 19:06 nwlandry

I was just thinking about this. The NodeStat functionality is going to be merged soon, and something I don't have implemented yet (but should) is precisely the capability to compute the distribution of a particular stat:

H.nodes.degree.asnumpy()
# -> array([d1, d2, ...])

H.nodes.degree.asdist()
# -> same output as np.histogram(array([d1, d2, ...]), density=True)

(I believe np.histogram(..., density=True) is essentially the same as what you are computing here (except maybe it assumes a continuous distribution?))

Implementing this once would allow us to also have an easy way to compute the distribution of edge sizes,

H.edges.size.asdist()

.. or of any other NodeStat or EdgeStat..

leotrs avatar Jun 03 '22 20:06 leotrs

I like this idea! I would prefer that we use np.unique because unless I'm mistaken, np.histogram uses bin edges instead of centers (although maybe there's a workaround?)

nwlandry avatar Jun 03 '22 20:06 nwlandry

Oh good point. Well I think in that case we can have two methods: H.nodes.degree.ashist(), returning bin centers and absolute/relative counts, and then H.nodes.degree.asdist(), returning bin edges and normalized counts. The first one would always assumes a discrete distribution and the latter one assumes a continuous distribution and also implements more complicated binning strategies like log-binning and whatnot.

leotrs avatar Jun 03 '22 20:06 leotrs

Coming back to this, I think I would like to implement the following, with the default output type being a dataframe with one row per bin:

H.nodes.degree.asdist()
# bin_center count
# 10         1243
# 20         343
# 30         12
# 40         1

This would allow us to do H.nodes.degree.asdist().plot() to plot the degree distribution in a single line.

Also, I'd like a parameter "bins" that behaves similarly to that of the same name of np.hist:

H.nodes.degree.asdist(bins=[list_of_bin_edges])
H.nodes.degree.asdist(bins=number_of_bins)
H.nodes.degree.asdist(bins="log")    # not supported by np.hist, but we may want to implement log-binning

Finally, I'd like another parameter to also output the bin edges in the dataframe:

H.nodes.degree.asdist(bin_edges=True)
# bin_center count  bin_lo  bin_hi
# 10         1243   1       15
# 20         343    15      25
# 30         12     25      45
# 40         1      35      45

Note that this functionality would be immediately available for each NodeStat/EdgeSTat. If this sounds good to yall then I'll go ahead and implement this.

Note to self: if this gets implemented, we should delete degree_histogram.

leotrs avatar Jul 28 '22 14:07 leotrs

Looks powerful. Would certainly be useful.

A few small thoughts:

  • I'm still unsure about the call syntax that is long with so many dots (not just for this one probably). Any thoughts?
  • why return a DataFrame rather than numpy arrays like np.hist? It does look pretty in notebooks. For consistency, do we have any "rule" as to when returning one or the other in XGI? I don't remember where else we return df right now.
  • dist I guess is for distribution but makes me think first of distance :) still wanted to use hist for something else?

maximelucas avatar Oct 27 '22 07:10 maximelucas

  • I'm still unsure about the call syntax that is long with so many dots (not just for this one probably). Any thoughts?

I agree I don't like it terribly. The only alternative I can come up with is to define a helper function in xgi/classes/function that literally just makes the long call:

def degree_dist(H, **kwargs):
    return H.nodes.degree.asdist(**kwargs)

It would make sense to do this for e.g. degree but I would be against doing this for any other NodeStat, so we would still have to do e.g. H.nodes.clustering.asdist() manually.

  • why return a DataFrame rather than numpy arrays like np.hist? It does look pretty in notebooks. For consistency, do we have any "rule" as to when returning one or the other in XGI? I don't remember where else we return df right now.

There is not rule right now but it would be nice to have one in the future. I don't think we return a dataframe by default at any other point in the library.

My original reason for returning a df here is that I hate the output returned by np.histogram. It's a tuple of two arrays: the first array contains the bin heights and the second one contains the bin edges. The reason it's a tuple of two arrays instead of a single 2D array is because these arrays have different lengths (there is one more bin edge than there are bin heights). It just seems awkward to me and AFAICT this is one of the (very?) few instances where numpy returns two arrays at the same time. It forces me to do something like heights, bin_edges = np.histogram(...) or even worse heights, bin_edges = np.histogram(bins=list_of_bin_edges) i.e. sometimes it returns exactly what you give it as input :vomiting_face: ... Finally, you can never do something like plt.plot(np.histogram(...)) exactly because it returns two arrays of different lengths. (Yes I know about plt.hist but if np.histogram was nicer, there would probably be no need for plt.hist in the first place.....)

If we return a df, there are none of these problems and we can cleanly do asdist().plot() :heart_eyes:

  • dist I guess is for distribution but makes me think first of distance :) still wanted to use hist for something else?

I'm not partial to asdist and ashist might be a better name indeed!

leotrs avatar Oct 27 '22 12:10 leotrs

The only alternative I can come up with is to define a helper function in xgi/classes/function that literally just makes the long call:

Agreed, we could do that for a few selected ones that are very common to use, but not all.

If we return a df, there are none of these problems and we can cleanly do asdist().plot() 😍

Agreed with the advantages.

There is not rule right now but it would be nice to have one in the future.

But exactly, we probably want to think about it more consistently if we start using it more in other cases. Probably fine for now.

maximelucas avatar Nov 07 '22 12:11 maximelucas

There is not rule right now but it would be nice to have one in the future.

But exactly, we probably want to think about it more consistently if we start using it more in other cases. Probably fine for now.

One quick thought is that I find returning two arrays of different lengths ugly and undesirable and a big NOPENOPENOPE.

leotrs avatar Nov 07 '22 19:11 leotrs