xgi
xgi copied to clipboard
add density function
We could add that with the group size as argument.
This would go in xgi/classes/function.py
?
Yes for now I'd rather have it work as xgi.density(H)
.
I was about to implement this an realize I don't know how the density is defined :sweat_smile:
@maximelucas is the density at group size d
equal to the number of hyperedges of size d
divided by nCd
where n
is the number of nodes? (nCd
is n choose k).
Is there a global density? Is it defined as e / 2^n
where e
is the total number of edges?
I've never looked for a definition but that's what I was thinking to do for size d
, yes.
For the global density, I would've just summed over d
. 2^n
is not the total number of potential hyperedges, is it?
In HyperNetX they do it as (https://pnnl.github.io/HyperNetX/build/_modules/reports/descriptive_stats.html#dist_stats):
- nrows = number of nodes (rows in the incidence matrix)
- ncols = number of edges (columns in the incidence matrix)
- ncells = number of filled cells in incidence matrix
- density = ncells/(nrows*ncols)
For the global density, I would've just summed over d. 2^n is not the total number of potential hyperedges, is it?
Yes, it is the total number of possible subsets of a set of size n
.
density = ncells/(nrows*ncols)
This is the density of the incidence matrix, interesting. I guess this is more an indicator of how big the hyperedges are rather than an indicator of how many hyperedges there are. For example, this density would equal 1
if the hypergraph has a single hyperedge that contains all nodes.
Ok yea my bad. Although 2^n
counts the empty set which I assume we don't want. So I'd say 2^n -1
?
(It also counts all singletons edges which I always feel a bit weird about but oh well)
For some reason Wolfram gave me a more complicated-looking answer when I asked to sum (n choose k)'s.
I agree about the other density. Also, when having multiple edges, it tells you how much their they intersect in terms of nodes, I think. Maybe we can implement it with another name, incidence_density
or something.
https://www.wolframalpha.com/input?i=sum+for+k+%3D+0+to+n+of+%28n+choose+k%29
I did it wrong 👍🏻
So I'd say 2^n -1?
That sounds like a good idea.
Also, when having multiple edges, it tells you how much their they intersect in terms of nodes, I think.
I don't think this is true. If you have exactly two edges of size a
and b
, on n
nodes, then this density is always equal to (a+b)/2n
, regardless of whether the edges are overlapping or not.
Maybe we can implement it with another name, incidence_density or something.
Yes I agree we should also implement it!
I guess in a way this density tells you how many stubs there are in the bipartite representation of the hypergraph divided by the total possible number of stubs without changing the number of nodes or edges.
The other density, the one that divides by 2^n
assumes the total possible number of edges.
That makes sense!
I would add also an option to not count singleton edges, like count_singletons=False
, in which case, I would divide by 2^n-1 - n
. Otherwise the density of a complete graph which has no singleton edges would not be 1.
We could also have a max_order
argument for the global density. If I have a hypergraph with 100 nodes, I will rarely want to know its global density up to order 99.
So this is what we're looking at now:
xgi.density(H)
# -> number of edges / (2^n - 1)
xgi.density(H, max_order=k)
# -> number of edges up to order k / (sum(nCd for d in range(1, k+1)) - 1)
xgi.density(H, max_order=k, count_singletons=False)
# -> number of edges up to order k / (sum(nCd for d in range(1, k+1)) - 1 - n)
xgi.density(H, order=k)
# -> number of edges of order k / nCk
xgi.density(H, incidence=True)
# -> number of nnz entries of the incidence matrix / m*n
xgi.density(H, max_order=k, incidence=True)
# -> same as above but only considering the columns of edges with order up to k
I don't think this is true. If you have exactly two edges of size
a
andb
, onn
nodes, then this density is always equal to(a+b)/2n
, regardless of whether the edges are overlapping or not.
Are you sure? Say you have two triangles.
- if not overlapping, say
[1,2,3]
,[4,5,6]
:- nrows * ncols = 6*2
- ncells = 6
- density = 1/2 (= your (a+b)/2)
- if overlapping, say
[1,2,3]
,[2,3,4]
:- nrows * ncols = 4*2
- ncells = 6
- density = 3/4 (not equal to (a+b)/2)
Great summary! I'm just unsure between having incidence
as an argument, and having a separate function density_incidence()
. In the first case it might get complicated to explain all the cases in the docs.
That's a good point. I guess I was thinking of a case where the two hypergraphs have the same number of nodes (in which case one would have isolated nodes). If the second hypergraph of your example also has six nodes, then the densities are the same.
(And I guess this also assumes that the incidence matrix has all-zero rows for isolated nodes, which I think it does now in our current implementation.)
Partially solved by #204. density_incidence
not implemented yet.