xgi
xgi copied to clipboard
Add features to `degree_histogram`
I wanted to easily visualize the degree/edge size distribution and the way that I found easiest to get both was the following:
degrees = dict(H.degree())
edge_sizes = dict(H.edge_size())
d, count_d = np.unique(list(degrees.values()), return_counts=True)
p_d = count_d/len(degrees)
e, count_e = np.unique(list(edge_sizes.values()), return_counts=True)
p_e = count_e/len(edge_sizes)
Currently, there is a way to do some of this with degree_histogram
, but I would like to be able to a) specify either edge sizes or degrees and b) specify whether it is normalized. And perhaps we should return numpy arrays instead of lists.
I was just thinking about this. The NodeStat functionality is going to be merged soon, and something I don't have implemented yet (but should) is precisely the capability to compute the distribution of a particular stat:
H.nodes.degree.asnumpy()
# -> array([d1, d2, ...])
H.nodes.degree.asdist()
# -> same output as np.histogram(array([d1, d2, ...]), density=True)
(I believe np.histogram(..., density=True)
is essentially the same as what you are computing here (except maybe it assumes a continuous distribution?))
Implementing this once would allow us to also have an easy way to compute the distribution of edge sizes,
H.edges.size.asdist()
.. or of any other NodeStat or EdgeStat..
I like this idea! I would prefer that we use np.unique because unless I'm mistaken, np.histogram uses bin edges instead of centers (although maybe there's a workaround?)
Oh good point. Well I think in that case we can have two methods: H.nodes.degree.ashist()
, returning bin centers and absolute/relative counts, and then H.nodes.degree.asdist()
, returning bin edges and normalized counts. The first one would always assumes a discrete distribution and the latter one assumes a continuous distribution and also implements more complicated binning strategies like log-binning and whatnot.
Coming back to this, I think I would like to implement the following, with the default output type being a dataframe with one row per bin:
H.nodes.degree.asdist()
# bin_center count
# 10 1243
# 20 343
# 30 12
# 40 1
This would allow us to do H.nodes.degree.asdist().plot()
to plot the degree distribution in a single line.
Also, I'd like a parameter "bins" that behaves similarly to that of the same name of np.hist:
H.nodes.degree.asdist(bins=[list_of_bin_edges])
H.nodes.degree.asdist(bins=number_of_bins)
H.nodes.degree.asdist(bins="log") # not supported by np.hist, but we may want to implement log-binning
Finally, I'd like another parameter to also output the bin edges in the dataframe:
H.nodes.degree.asdist(bin_edges=True)
# bin_center count bin_lo bin_hi
# 10 1243 1 15
# 20 343 15 25
# 30 12 25 45
# 40 1 35 45
Note that this functionality would be immediately available for each NodeStat/EdgeSTat. If this sounds good to yall then I'll go ahead and implement this.
Note to self: if this gets implemented, we should delete degree_histogram
.
Looks powerful. Would certainly be useful.
A few small thoughts:
- I'm still unsure about the call syntax that is long with so many dots (not just for this one probably). Any thoughts?
- why return a DataFrame rather than numpy arrays like np.hist? It does look pretty in notebooks. For consistency, do we have any "rule" as to when returning one or the other in XGI? I don't remember where else we return df right now.
-
dist
I guess is for distribution but makes me think first of distance :) still wanted to usehist
for something else?
- I'm still unsure about the call syntax that is long with so many dots (not just for this one probably). Any thoughts?
I agree I don't like it terribly. The only alternative I can come up with is to define a helper function in xgi/classes/function
that literally just makes the long call:
def degree_dist(H, **kwargs):
return H.nodes.degree.asdist(**kwargs)
It would make sense to do this for e.g. degree but I would be against doing this for any other NodeStat, so we would still have to do e.g. H.nodes.clustering.asdist()
manually.
- why return a DataFrame rather than numpy arrays like np.hist? It does look pretty in notebooks. For consistency, do we have any "rule" as to when returning one or the other in XGI? I don't remember where else we return df right now.
There is not rule right now but it would be nice to have one in the future. I don't think we return a dataframe by default at any other point in the library.
My original reason for returning a df here is that I hate the output returned by np.histogram
. It's a tuple of two arrays: the first array contains the bin heights and the second one contains the bin edges. The reason it's a tuple of two arrays instead of a single 2D array is because these arrays have different lengths (there is one more bin edge than there are bin heights). It just seems awkward to me and AFAICT this is one of the (very?) few instances where numpy returns two arrays at the same time. It forces me to do something like heights, bin_edges = np.histogram(...)
or even worse heights, bin_edges = np.histogram(bins=list_of_bin_edges)
i.e. sometimes it returns exactly what you give it as input :vomiting_face: ... Finally, you can never do something like plt.plot(np.histogram(...))
exactly because it returns two arrays of different lengths. (Yes I know about plt.hist
but if np.histogram
was nicer, there would probably be no need for plt.hist
in the first place.....)
If we return a df, there are none of these problems and we can cleanly do asdist().plot()
:heart_eyes:
dist
I guess is for distribution but makes me think first of distance :) still wanted to usehist
for something else?
I'm not partial to asdist
and ashist
might be a better name indeed!
The only alternative I can come up with is to define a helper function in
xgi/classes/function
that literally just makes the long call:
Agreed, we could do that for a few selected ones that are very common to use, but not all.
If we return a df, there are none of these problems and we can cleanly do asdist().plot() 😍
Agreed with the advantages.
There is not rule right now but it would be nice to have one in the future.
But exactly, we probably want to think about it more consistently if we start using it more in other cases. Probably fine for now.
There is not rule right now but it would be nice to have one in the future.
But exactly, we probably want to think about it more consistently if we start using it more in other cases. Probably fine for now.
One quick thought is that I find returning two arrays of different lengths ugly and undesirable and a big NOPENOPENOPE.