xgi icon indicating copy to clipboard operation
xgi copied to clipboard

add density function

Open maximelucas opened this issue 2 years ago • 15 comments

We could add that with the group size as argument. This would go in xgi/classes/function.py?

maximelucas avatar Oct 03 '22 08:10 maximelucas

Yes for now I'd rather have it work as xgi.density(H).

leotrs avatar Oct 03 '22 13:10 leotrs

I was about to implement this an realize I don't know how the density is defined :sweat_smile:

@maximelucas is the density at group size d equal to the number of hyperedges of size d divided by nCd where n is the number of nodes? (nCd is n choose k).

Is there a global density? Is it defined as e / 2^n where e is the total number of edges?

leotrs avatar Oct 27 '22 15:10 leotrs

I've never looked for a definition but that's what I was thinking to do for size d, yes.

For the global density, I would've just summed over d. 2^n is not the total number of potential hyperedges, is it?

In HyperNetX they do it as (https://pnnl.github.io/HyperNetX/build/_modules/reports/descriptive_stats.html#dist_stats):

  • nrows = number of nodes (rows in the incidence matrix)
  • ncols = number of edges (columns in the incidence matrix)
  • ncells = number of filled cells in incidence matrix
  • density = ncells/(nrows*ncols)

maximelucas avatar Oct 28 '22 07:10 maximelucas

For the global density, I would've just summed over d. 2^n is not the total number of potential hyperedges, is it?

Yes, it is the total number of possible subsets of a set of size n.

leotrs avatar Oct 28 '22 07:10 leotrs

density = ncells/(nrows*ncols)

This is the density of the incidence matrix, interesting. I guess this is more an indicator of how big the hyperedges are rather than an indicator of how many hyperedges there are. For example, this density would equal 1 if the hypergraph has a single hyperedge that contains all nodes.

leotrs avatar Oct 28 '22 07:10 leotrs

Ok yea my bad. Although 2^n counts the empty set which I assume we don't want. So I'd say 2^n -1? (It also counts all singletons edges which I always feel a bit weird about but oh well) For some reason Wolfram gave me a more complicated-looking answer when I asked to sum (n choose k)'s.

I agree about the other density. Also, when having multiple edges, it tells you how much their they intersect in terms of nodes, I think. Maybe we can implement it with another name, incidence_density or something.

maximelucas avatar Oct 28 '22 07:10 maximelucas

https://www.wolframalpha.com/input?i=sum+for+k+%3D+0+to+n+of+%28n+choose+k%29

leotrs avatar Oct 28 '22 08:10 leotrs

I did it wrong 👍🏻

maximelucas avatar Oct 28 '22 09:10 maximelucas

So I'd say 2^n -1?

That sounds like a good idea.

Also, when having multiple edges, it tells you how much their they intersect in terms of nodes, I think.

I don't think this is true. If you have exactly two edges of size a and b, on n nodes, then this density is always equal to (a+b)/2n, regardless of whether the edges are overlapping or not.

Maybe we can implement it with another name, incidence_density or something.

Yes I agree we should also implement it!

leotrs avatar Oct 28 '22 10:10 leotrs

I guess in a way this density tells you how many stubs there are in the bipartite representation of the hypergraph divided by the total possible number of stubs without changing the number of nodes or edges.

The other density, the one that divides by 2^n assumes the total possible number of edges.

leotrs avatar Oct 28 '22 10:10 leotrs

That makes sense!

I would add also an option to not count singleton edges, like count_singletons=False, in which case, I would divide by 2^n-1 - n. Otherwise the density of a complete graph which has no singleton edges would not be 1.

We could also have a max_order argument for the global density. If I have a hypergraph with 100 nodes, I will rarely want to know its global density up to order 99.

maximelucas avatar Oct 28 '22 11:10 maximelucas

So this is what we're looking at now:

xgi.density(H)
# -> number of edges / (2^n - 1)

xgi.density(H, max_order=k)
# -> number of edges up to order k / (sum(nCd for d in range(1, k+1)) - 1)

xgi.density(H, max_order=k, count_singletons=False)
# -> number of edges up to order k / (sum(nCd for d in range(1, k+1)) - 1 - n)

xgi.density(H, order=k)
# -> number of edges of order k / nCk

xgi.density(H, incidence=True)
# -> number of nnz entries of the incidence matrix / m*n

xgi.density(H, max_order=k, incidence=True)
# -> same as above but only considering the columns of edges with order up to k

leotrs avatar Oct 28 '22 11:10 leotrs

I don't think this is true. If you have exactly two edges of size a and b, on n nodes, then this density is always equal to (a+b)/2n, regardless of whether the edges are overlapping or not.

Are you sure? Say you have two triangles.

  • if not overlapping, say [1,2,3], [4,5,6]:
    • nrows * ncols = 6*2
    • ncells = 6
    • density = 1/2 (= your (a+b)/2)
  • if overlapping, say [1,2,3], [2,3,4]:
    • nrows * ncols = 4*2
    • ncells = 6
    • density = 3/4 (not equal to (a+b)/2)

maximelucas avatar Oct 28 '22 11:10 maximelucas

Great summary! I'm just unsure between having incidence as an argument, and having a separate function density_incidence(). In the first case it might get complicated to explain all the cases in the docs.

maximelucas avatar Oct 28 '22 11:10 maximelucas

That's a good point. I guess I was thinking of a case where the two hypergraphs have the same number of nodes (in which case one would have isolated nodes). If the second hypergraph of your example also has six nodes, then the densities are the same.

(And I guess this also assumes that the incidence matrix has all-zero rows for isolated nodes, which I think it does now in our current implementation.)

leotrs avatar Oct 28 '22 11:10 leotrs

Partially solved by #204. density_incidence not implemented yet.

maximelucas avatar Oct 31 '22 08:10 maximelucas