casper-node icon indicating copy to clipboard operation
casper-node copied to clipboard

Non-det failure of `run_equivocator_network` test.

Open goral09 opened this issue 4 years ago • 22 comments

https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1453/3/3 failed with

failures:

---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'assertion failed: `(left == right)`
  left: `[]`,
 right: `[PublicKey::Ed25519(5bc5542a73a3fd4e53c2f3c84838156cf8ce5de0db828a5644b1c6d883056aa6)]`', node/src/reactor/participating/tests.rs:319:13

source: https://github.com/casper-network/casper-node/blob/dev/node/src/reactor/participating/tests.rs#L319

4.11.23 can we create equivocation through the diagnostic port? Goal: make this deterministic and then address. Use the equivocation mechanism available in Zug

goral09 avatar Aug 05 '21 17:08 goral09

As discussed, I can't reproduce this, but added more asserts to the test in #1862 in case it happens again.

afck avatar Aug 06 '21 15:08 afck

Reopening since it happened again:

---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'Alice should have been evicted.', node/src/reactor/participating/tests.rs:304:5

==============================================================
Thread: reactor::participating::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=fd02296c9f70cb27da4531fd5207b884
==============================================================

That means the test ran for 20 eras and Alice didn't equivocate. I guess having two nodes with the same key simply doesn't guarantee an equivocation (which is a good thing!). For this test we'll somehow need to ensure that it actually happens.

afck avatar Aug 09 '21 09:08 afck

I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.

One way to implement this is to wrap the reactor in a new FilteringReactor that in dispatch_event applies a configurable filter

Box<dyn Fn(Event) -> Either<Effects<Event>, Event>

and either returns the effects directly or forwards to the inner reactor's dispatch_event.

In run_equivocator_network the filter would wrap all events containing IncomingMessage or NetworkRequest in

self.effect_builder.set_timeout(round_len).event(move |_| event)

to delay them.

afck avatar Aug 12 '21 10:08 afck

I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.

One way to implement this is to wrap the reactor in a new FilteringReactor that in dispatch_event applies a configurable filter

Box<dyn Fn(Event) -> Either<Effects<Event>, Event>

and either returns the effects directly or forwards to the inner reactor's dispatch_event.

In run_equivocator_network the filter would wrap all events containing IncomingMessage or NetworkRequest in

self.effect_builder.set_timeout(round_len).event(move |_| event)

to delay them.

I like the idea. ~~Similar but a different approach would be to expose map/flat_map method on the Network itself that would forward the function to the map/flat_map methods on the Effect type.~~.

EDIT: My approach wouldn't actually work as it would operate on the Effects but we need to access Event.

goral09 avatar Aug 12 '21 12:08 goral09

Yes, Effects are futures, and you can't match on them.

afck avatar Aug 12 '21 14:08 afck

Yes, Effects are futures, and you can't match on them.

There's already a map on Effect type so what I was proposing was adding a map on Network call call each individual Effect::map and pass in the function from the outer Network. That still wouldn't be equivalent to your proposal though.

goral09 avatar Aug 12 '21 14:08 goral09

Ah, right, then I think it could work. It might be a bit more complicated, though.

afck avatar Aug 12 '21 14:08 afck

Reopening, as it happened again in https://github.com/casper-network/casper-node/pull/1935: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1744/3/3

afck avatar Aug 19 '21 07:08 afck

We doubled the test's era height in https://github.com/casper-network/casper-node/pull/1935, so nodes have more time to detect the equivocation.

afck avatar Aug 20 '21 07:08 afck

Looks like that didn't work; it failed again.

afck avatar Aug 23 '21 13:08 afck

Experienced this as well, just fyi: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1988

TomVasile avatar Sep 01 '21 18:09 TomVasile

@goral09 is this fixed?

piotr-dziubecki avatar Sep 16 '21 17:09 piotr-dziubecki

Closing this, until it shows up again.

afck avatar Oct 04 '21 12:10 afck

…and it did: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2692/3/3 So Alice failed to equivocate again: reactor/participating/tests.rs#L361

afck avatar Oct 19 '21 08:10 afck

FYI, Joe is seeing this also fail non-deterministically in the 1.4.0 release branch: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2748/3/3

TomVasile avatar Oct 20 '21 22:10 TomVasile

another failure: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2760/3/3

darthsiroftardis avatar Oct 21 '21 15:10 darthsiroftardis

And another one: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/3082/3/3

afck avatar Nov 15 '21 18:11 afck

This test is going to be temporarily disabled. It fails so often, that we tend to just retry the build and don't investigate if it uncovered a real problem.

@Fraser999 @piotr-dziubecki @EdHastingsCasperLabs We think this test is very important, though, so please consider increasing priority of the ticket.

rafal-ch avatar Dec 03 '21 15:12 rafal-ch

As discussed I will remove the #[ignore] and temporarily disable only the flaky assertion for now.

afck avatar Dec 13 '21 16:12 afck

To give some more context about the test in general: This is meant to simulate an equivocation, by having Alice run two nodes, as well as inactivity (on feat-fast-sync-v2), by having Bob run none.

The difficulty is to make Alice's nodes actually double-sign something despite the precautions in the code that defend against just that: The test tries to keep the two nodes (both configured with Alice's secret key) separated until they both actually signed something without knowing about the other. To do that, it delays incoming and outgoing messages in one of her nodes. It also needs to recognize messages that actually can't be part of a pair of conflicting signatures, like Highway's Pings: These still need to be delayed, so that the nodes don't learn about each other, but they mustn't be taken as a sign that the desired equivocation has been achieved.

Equivocation is possible:

  • In Highway by signing two different Units with the same sequence number.
  • In Zug by signing two Echos with a different hash, or both a Vote(true) and Vote(false), in the same round.

afck avatar Dec 14 '22 14:12 afck

Maybe add a bounty for such time-intensive/frustrating/sporadic bugs.

ghost avatar Mar 18 '23 19:03 ghost

This may be addressed now, depending on how #4551 shakes out. If that makes it into dev and no more failures occur, we can probably consider this addressed.

Potentially relevant commits:

marc-casperlabs avatar Feb 23 '24 14:02 marc-casperlabs