casper-node
casper-node copied to clipboard
Non-det failure of `run_equivocator_network` test.
https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1453/3/3 failed with
failures:
---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'assertion failed: `(left == right)`
left: `[]`,
right: `[PublicKey::Ed25519(5bc5542a73a3fd4e53c2f3c84838156cf8ce5de0db828a5644b1c6d883056aa6)]`', node/src/reactor/participating/tests.rs:319:13
source: https://github.com/casper-network/casper-node/blob/dev/node/src/reactor/participating/tests.rs#L319
4.11.23 can we create equivocation through the diagnostic port? Goal: make this deterministic and then address. Use the equivocation mechanism available in Zug
As discussed, I can't reproduce this, but added more asserts to the test in #1862 in case it happens again.
Reopening since it happened again:
---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'Alice should have been evicted.', node/src/reactor/participating/tests.rs:304:5
==============================================================
Thread: reactor::participating::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=fd02296c9f70cb27da4531fd5207b884
==============================================================
That means the test ran for 20 eras and Alice didn't equivocate. I guess having two nodes with the same key simply doesn't guarantee an equivocation (which is a good thing!). For this test we'll somehow need to ensure that it actually happens.
I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.
One way to implement this is to wrap the reactor in a new FilteringReactor that in dispatch_event applies a configurable filter
Box<dyn Fn(Event) -> Either<Effects<Event>, Event>
and either returns the effects directly or forwards to the inner reactor's dispatch_event.
In run_equivocator_network the filter would wrap all events containing IncomingMessage or NetworkRequest in
self.effect_builder.set_timeout(round_len).event(move |_| event)
to delay them.
I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.
One way to implement this is to wrap the reactor in a new
FilteringReactorthat indispatch_eventapplies a configurable filterBox<dyn Fn(Event) -> Either<Effects<Event>, Event>and either returns the effects directly or forwards to the inner reactor's
dispatch_event.In
run_equivocator_networkthe filter would wrap allevents containingIncomingMessageorNetworkRequestinself.effect_builder.set_timeout(round_len).event(move |_| event)to delay them.
I like the idea. ~~Similar but a different approach would be to expose map/flat_map method on the Network itself that would forward the function to the map/flat_map methods on the Effect type.~~.
EDIT: My approach wouldn't actually work as it would operate on the Effects but we need to access Event.
Yes, Effects are futures, and you can't match on them.
Yes,
Effects are futures, and you can'tmatchon them.
There's already a map on Effect type so what I was proposing was adding a map on Network call call each individual Effect::map and pass in the function from the outer Network. That still wouldn't be equivalent to your proposal though.
Ah, right, then I think it could work. It might be a bit more complicated, though.
Reopening, as it happened again in https://github.com/casper-network/casper-node/pull/1935: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1744/3/3
We doubled the test's era height in https://github.com/casper-network/casper-node/pull/1935, so nodes have more time to detect the equivocation.
Looks like that didn't work; it failed again.
Experienced this as well, just fyi: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1988
@goral09 is this fixed?
Closing this, until it shows up again.
…and it did: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2692/3/3 So Alice failed to equivocate again: reactor/participating/tests.rs#L361
FYI, Joe is seeing this also fail non-deterministically in the 1.4.0 release branch: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2748/3/3
another failure: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2760/3/3
And another one: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/3082/3/3
This test is going to be temporarily disabled. It fails so often, that we tend to just retry the build and don't investigate if it uncovered a real problem.
@Fraser999 @piotr-dziubecki @EdHastingsCasperLabs We think this test is very important, though, so please consider increasing priority of the ticket.
As discussed I will remove the #[ignore] and temporarily disable only the flaky assertion for now.
To give some more context about the test in general: This is meant to simulate an equivocation, by having Alice run two nodes, as well as inactivity (on feat-fast-sync-v2), by having Bob run none.
The difficulty is to make Alice's nodes actually double-sign something despite the precautions in the code that defend against just that: The test tries to keep the two nodes (both configured with Alice's secret key) separated until they both actually signed something without knowing about the other. To do that, it delays incoming and outgoing messages in one of her nodes. It also needs to recognize messages that actually can't be part of a pair of conflicting signatures, like Highway's Pings: These still need to be delayed, so that the nodes don't learn about each other, but they mustn't be taken as a sign that the desired equivocation has been achieved.
Equivocation is possible:
- In Highway by signing two different
Units with the same sequence number. - In Zug by signing two
Echos with a different hash, or both aVote(true)andVote(false), in the same round.
Maybe add a bounty for such time-intensive/frustrating/sporadic bugs.
This may be addressed now, depending on how #4551 shakes out. If that makes it into dev and no more failures occur, we can probably consider this addressed.
Potentially relevant commits: