libgraph icon indicating copy to clipboard operation
libgraph copied to clipboard

Duplicate vertex ids when adding high number of vertices

Open DavidB59 opened this issue 2 years ago • 12 comments

When adding large number of vertices, some of my vertices weren't added to the graph.

For example

vertices = 0..250000 |> Enum.map(& &1)
Graph.add_vertices(Graph.new, vertices)
#Graph<type: directed, num_vertices: 249997, num_edges: 0>

I fixed it for myself by changing the function Graph.Utils.vertex_id(v) to def vertex_id(v), do: v instead of def vertex_id(v), do: :erlang.phash2(v, @max_phash)

Apparently I had duplicate key in my graph otherwise.

DavidB59 avatar Dec 15 '21 08:12 DavidB59

@DavidB59 Advent of Code 2021, day 15, am I right? :D

I had differnet problem, when adding huge number of edges (250_000) there were some edges added which were not present at the edges list. Probably related to ^^

I fixed it by filtering out the bad edges from the graph and then deleting it:

    graph = Graph.add_edges(Graph.new(), edges(points))

    # WTF? I have no idea why those edges are created. They are not returned by edges/1 function
    # but somehow end up in the final graph. Bug in libgraph?
    fail =
      graph
      |> Graph.edges()
      |> Enum.filter(fn %Graph.Edge{v1: {x1, y1}, v2: {x2, y2}} ->
        abs(x1 - x2) > 1 and abs(y1 - y2) > 1
      end)

    graph
    |> Graph.delete_edges(fail)

But your fix is better and works for my case too.

Sgiath avatar Dec 15 '21 09:12 Sgiath

This PR: https://github.com/bitwalker/libgraph/pull/43 allows user-defined vertex IDs, so (knowing that the x,y values don't need hashing, and you've got a fixed size) you could just use y * max_x + x instead of an actual hash.

rlipscombe avatar Dec 15 '21 11:12 rlipscombe

I just want to say, I had exactly the same issue, I, too, was trying to figure out why Advent of Code 2021, day 15 was breaking for me as well, I'm glad I found this.!

joegiralt avatar Dec 15 '21 19:12 joegiralt

#43 has been merged, and will be released in 0.15.0, which at least provides a solution for situations in which you can provide a better vertex ID than phash2.

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

bitwalker avatar Dec 15 '21 20:12 bitwalker

#43 seems to be a good way to solve this issue, but I think there is an issue with its implementation:

Graph.Pathfinding.do_bfs/5 is still using Graph.Utils.vertex_id

https://github.com/bitwalker/libgraph/blob/f033439e2433a106e4104ed4540f67fd9381a53c/lib/graph/pathfinding.ex#L130-L136

and thus all pathfinding will fail when using a custom vertex id.

I've opened a new PR (https://github.com/bitwalker/libgraph/pull/47) to fix this.

renaudlenne avatar Dec 15 '21 22:12 renaudlenne

I think that hashing the vertex and not taking collisions into account is an incorrect approach when determining the identity of vertices. ID needs to be unique, so collisions are a no-no, right?

Since Elixir has nice value-based semantics for equality, the suggested work-around

def vertex_id(v), do: v

looks like a correct approach.

Large graphs expose the issue, but all it takes is v1 and v2 that hash to same value to cause the problem.

anttipoi avatar Dec 16 '21 06:12 anttipoi

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

Birthday paradox raises its ugly head. A set of n elements contains n(n+1)/2 pairs of elements, or in this case 31_250_125_000. All those pairs need to avoid a collision for hashing to work.

anttipoi avatar Dec 16 '21 06:12 anttipoi

iirc, a 32-bit result only allows for 2^16 items before you exceed a 50% chance of collision, even with perfect hashing (which phash2 is decidedly not). 250K > 65K. A pluggable hashing algorithm (#43) is good because it would allow us to use, e.g., 256-bit hashing.

rlipscombe avatar Dec 16 '21 10:12 rlipscombe

@anttipoi The reason to have a notion of a "vertex id" at all is to avoid having many copies of large vertex values in the graph datastructure. By using integer values, we can keep the overall size of the graph as small as possible. Naturally, this benefit doesn't apply when you are using integers for the vertex values to begin with, which is the source of the issue here. So the pluggable hashing available in 0.15.0 provides a solution for this, but if anyone has additional improvements/feedback around this, I'm certainly open to further discussion.

bitwalker avatar Dec 16 '21 14:12 bitwalker

Im struggling with the same advent of code problem with libgraph, I can't tell if its my code that is wonky (or I am getting a similar but different bug)

https://gist.github.com/jkbbwr/eaec9b319909bd98e17feae54aca8856

I know this isn't a place for getting help with advent of code, but I can't tell if im being dumb or this is a legit bug

jkbbwr avatar Dec 17 '21 19:12 jkbbwr

I believe this issue can be closed as fixed.

stevensonmt avatar Mar 11 '24 19:03 stevensonmt

#43 has been merged, and will be released in 0.15.0, which at least provides a solution for situations in which you can provide a better vertex ID than phash2.

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

selecting 250 000 numbers from 2^32 possibilities is almost certain, probabilistically, to create at least one clash. Using the birthday problem approximate formula:

iex(1)> prob = fn(poss, n) -> 1 - :math.exp(-((n ** 2) / (2 * poss))) end
#Function<43.65746770/2 in :erl_eval.expr/5>
iex(2)> prob.(2 ** 32, 9292)
0.010001099079572806
iex(3)> prob.(2 ** 32, 250000)
0.999308022843814
iex(4)>

99.93% chance.

Even anything above 9k has a 1% chance of a hash collision.

You probably want something bigger eg sha256

Enum.map(1..250000, fn(x) -> :crypto.hash(:sha256, :erlang.term_to_binary(x)) end)

With 256 your probability of a clash becomes tiny, and noticeable only after 10**35 or so which is way bigger than any memory.

prob = fn(poss, n) -> 1 - :math.exp(-((n ** 2) / (2 * poss))) end
Enum.map(0..40, fn(p) -> [ten_to_power: p, num_vertices: 10**p, probability_of_clash: prob.(2**256, 10**p)] end)

vegabook avatar Mar 11 '24 19:03 vegabook