tmkms icon indicating copy to clipboard operation
tmkms copied to clipboard

Validator disconnect on step regression

Open mdyring opened this issue 5 years ago • 7 comments

In below log, 5 validators are connected to single tmkms process on same chain-id. It appears that during a Proposal, some validators (.240 and .108 below) get disconnected due to step regression.

Is this as expected? It does not seem to happen with validators attempting to double sign with different block ids, so I would think not.

Disconnecting seems undesirable when trying to do HA. ;-)

tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed Proposal:60E46FCB09 at h/r/s 88801/0/3 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://35.156.81.246:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 85CC5AD4DD)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://54.93.169.244:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 9DFE0E1AE9)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://35.158.126.240:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://18.196.63.108:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 35.158.126.240:26659: unverified validator peer ID! (ADCA89A67DAA5FFC2D2EFFD8AD33B1DC8DCD9466)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://35.158.126.240:26659] connected to validator successfully
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 18.196.63.108:26659: unverified validator peer ID! (00ACD9BDBE36B2933231545AEBF1D799DD06D83B)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://18.196.63.108:26659] connected to validator successfully

mdyring avatar Jul 28 '19 18:07 mdyring

@mdyring I'd suggest opening a corresponding issue on https://github.com/tendermint/tendermint since I do not believe the KMS is closing the connection here, but rather my best guess is Tendermint is closing the connection after receiving a double-signing error response. It'd be good to confirm these are both the case, though.

tarcieri avatar Jul 29 '19 17:07 tarcieri

@tarcieri done, thanks. From looking at Tendermint log it looks like it sees the connection close, I have added log output to https://github.com/tendermint/tendermint/issues/3844.

mdyring avatar Jul 29 '19 19:07 mdyring

Here's the relevant KMS codepath. It returns an error response to Tendermint, but otherwise considers the request successfully handled and therefore it shouldn't abort the connection.

You can also turn on debug logging on the KMS by passing the -v or --verbose option to tmkms start.

tarcieri avatar Jul 29 '19 19:07 tarcieri

From looking at Tendermint log it looks like it sees the connection close, I have added log output to tendermint/tendermint#3844.

I looked at the attached log output but I don't see what you're talking about re: Tendermint showing the connection being remotely closed by the KMS.

tarcieri avatar Jul 29 '19 19:07 tarcieri

Sorry, should have been clearer: First line has a "EOF" at the end, indicating the connection was closed.

This seems to indicate the connection closure is unexpected for Tendermint, but it could also just be a case of Tendermint closing the connection earlier and then failing on a read later and logging that.

mdyring avatar Jul 29 '19 20:07 mdyring

Here's the relevant KMS codepath.

Isn't this only for the StateErrorKind::DoubleSign ? Martin's error seems to be of StateErrorKind::StepRegression ie. https://github.com/tendermint/kms/blob/3b3aabf8eab64959dad29ee2d5b747dc5fe813ac/src/chain/state.rs#L92-L99 which means sign would return an error, which I think exits the request_loop ?

ebuchman avatar Aug 23 '19 01:08 ebuchman

@ebuchman good catch, that appears to be what's happening.

Perhaps the KMS should send back an error to Tendermint on StateErrorKind::StepRegression as well.

tarcieri avatar Aug 25 '19 18:08 tarcieri