rdf4j Graph isomorphism has edge-case hash collisions when using 128 bit hash

Graph isomorphism has edge-case hash collisions when using 128 bit hash

Open abrokenjester opened this issue 2 years ago • 4 comments

Current Behavior

In GH-3671 we discovered an example graph that intermittently fails the "self-isomorphism" test. We have not fully identified the root cause of this, but speculate there is some sort of hash collision going on when calculating blank node labels for the iso-canonical mapping. It's been identified that in the specific example case, when it fails, there's exactly two blank nodes that differ between the two models.

A quick fix to deal with the problem has been put in place: we've started using a stronger hash key (256 bits). This mitigates the problem but is less desirable from a performance perspective. We should do a followup investigation in the algorithm implementation to establish if it is indeed hash collisions causing the problem, and if so, how these occur (in theory, a 128 bit hash should be more than safe enough for most practical cases).

Speculation is that in the combination of hashes that happens as part of the algorithm we somehow collapse the hash value space. But it needs more detailed investigation.

Expected Behavior

The isomorphism algorithm should consistently be correct in determining if two graphs are isomorphic, using a 128 bit hash.

Steps To Reproduce

Adjust hash algorithm used from sha256 to murmur3_128. Existing test case will start failing.

Version

3.7.5

Are you interested in contributing a solution yourself?

No response

Anything else?

See discussion at https://github.com/eclipse/rdf4j/pull/3701 for more details.

Feb 27 '22 21:02 abrokenjester

I was looking into this issue a bit and I think it is not a hash collision that is at fault here. As you mentioned, a hash collision is also quite unlikely even with murmur3_128. In addition, this would raise the question why it doesn't fail reliably on the first run, the hash collision should always be the same. And lastly - I wouldn't want to prove it - but if there was a hash collision I would actually rather expect a false positive than a false negative.

First, some observations I made along the way, that may also help:

There's even more randomness: Running the test case on my machine with murmur3_128, I could not reproduce the issue even though I ran a 100 iterations a couple of times
When using murmur3_32 instead I could reproduce the issue with the pattern hinted at in the test: It rarely fails on first run, quite frequently on the second run and sometimes only on third or even fourth run

Now that I had to lower the size of the hash almost looks like it supports the hash collision theory, but a hash collision should still always collide the same and not just on the 2nd or later tries.

I think in the end the real problem is that we rely on the order of a set which is not guaranteed. Consider the following code in hashBNodes:

for (BNode b : partitioning.getNodes()) {
	for (Statement st : m.getStatements(b, null, null)) {
		HashCode c = hashTuple(
				partitioning.getPreviousHashCode(st.getObject()),
				partitioning.getPreviousHashCode(st.getPredicate()),
				outgoing);
		partitioning.setCurrentHashCode(b,
				hashBag(c, partitioning.getCurrentHashCode(b)));
	}
[...]

and a BNode that has as object another BNode, e.g.

_:b1 <prop> _:b2
:b2 <label> "label"

While the order of the statements of b1 (if there were more than 1) does not matter thanks to the hashBag, the order in which the BNodes are hashed does matter. The HashCode c in the snippet above is different for b1 depending on whether b2is hashed first or second. However, since partition.getNodes() is a set this order is not guaranteed.

But this is still not the issue in this case it seems - at least changing anything there did not solve the issue for me. There is one other place where we set the hash code of a BNode and that is in the distinguishmethod. The partitioning.getLowestNonTrivialPartition() returns a Collection, behind this collection is again a set and again we iterate through that set. In combination with how hashBNodes is not oblivious to order, it makes a difference which of the BNodes in partitioning.getLowestNonTrivialPartition() we pick.

This theory also makes more sense imo since while the order of a set is not guaranteed, it probably happens to be the same almost all of the time. That's also where the randomness comes in: This quite likely almost always happens to work until it doesn't. Why the order of the set changes for the second Model somewhere during iteration 1-4 is anyones guess but since its not guaranteed it can happen - maybe its just some JIT thing or who knows. In the end it also doesn't matter since its not guaranteed we cannot rely on it.

I did some experiments on this and indeed, if I impose an order on the BNode sets (I chose node IDs because its the only thing that could be easily used) it worked for me even with murmur3_32. The problem with this is that we now rely on the property that the order of the ids/order of parsing of the blank nodes into the Models is somehow stable which I guess is quite the same problem as with the sets: while it happens to be the case its not actually guaranteed. So unfortunately while the patch seems to fix it (although it remains to be proven that it actually works on your machine too), there are still somewhat fundamental flaws in algorithm.

Here's my changes: https://github.com/arminfriedl/rdf4j/compare/main...3702

I'd appreciate if you could try this with murmur3_128 on your machine to see if this actually helps. This is really hard to debug through completely so I have no 100% proof that this is the issue.

Mar 06 '22 14:03 arminfriedl

Thank you for the thorough analysis @arminfriedl , really useful finds. I will try and find time to test your fix later this week. I'd also like to refer back to the technical paper with these insights, and double check that I haven't missed any assumptions about order.

Mar 06 '22 18:03 abrokenjester

I was reading through the paper - which is pretty amazing by the way - and if I understood this correctly, hashBNodes by itself is actually perfectly safe with respect to any order.

However, for the distinguish - if I got this right - they write:

In the previous example, we saw that all leaves produce the same labelled graph. It is then interesting to consider in what cases such leaves would produce different labelled graphs; in fact, finding such examples is non-trivial. While some such examples have been found [7 ], the constructions involved are quite intricate. We expect in most such cases that only one distinct labelled graph is found [7].

Hogan (2017), p.24

Could this be the culprit here?

I'll try to understand what the mentioned paper in [7] has to say although "non-trivial" and "intricate" does not give me much hope I'll be able to understand it.

Mar 07 '22 07:03 arminfriedl

@arminfriedl I have tried your fix with murmur3_128 and can confirm that the test case no longer fails with that in place.

However, rather confusingly, I now suddenly can no longer reproduce the original issue on my machine either, with murmur3_128 and no other changes....

Also, like you, I'm not sure why imposing order on the bnodes in the partition should make a difference really.

Mar 12 '22 03:03 abrokenjester

rdf4j rdf4j copied to clipboard

Graph isomorphism has edge-case hash collisions when using 128 bit hash

Current Behavior

Expected Behavior

Steps To Reproduce

Version

Are you interested in contributing a solution yourself?

Anything else?

rdf4j
rdf4j copied to clipboard