neo-go
neo-go copied to clipboard
consensus: add dBFT watchdog
Requested by @realloc.
Problem
dBFT is not perfect, and sometimes it hangs in a deadlock.
Solution
Add dBFT watchdog. Watchdog tracks time spent since the last accepted block and enforses the consensus reset on threshold.
No documentation update included. If we'll decide to include this feature into master, then a separate PR will be created.
Depends on #2567 and #2566.
The meaning of the watchdog is to prevent any chain halt situation that we couldn't foresee. That's the requirement for on-prem installs and it's not related to #2552 at all.
In on-prem installs, where neo-go is used as a database, we can't restart neo-go instance by external system as it affects the whole stack operations. At least 3 times on prod we had a situation with locked consensus mechanism causing the whole cluster malfunction, which was fixed by simultaneous neo-go nodes restart. To prevent such situation we need this kind of watchdog doing a simultaneous consensus mechanism restart without affecting the RPC operations. As we discussed with @fyrchik and @AnnaShaleva, the restart could be tied to the nearest round number minute or similar point in time.
As we need this anyway, let's try to also attach some instructions for operations people on how to gather the most of useful debug information when they hit the consensus halt situation again.
we can't restart neo-go instance by external system was fixed by simultaneous neo-go nodes restart
Hmm.
My presumption is that all of the services are monitored anyway and if they are, external entities can do whatever they want if needed.
we had a situation with locked consensus mechanism
dBFT runs over the internet on chains with many millions of blocks. We know it's not perfect, but it works more often than not. If we have any problem with dBFT it needs to be properly investigated and fixed.
The problem with this PR is that in fact we don't know how it will affect the system at all. Each node has a clock and a timer of its own, it's not the same as when you externally restart all of them within very short time frame.
As it is written in the code comment, we need this option to be toggled in on-prem environments only.
My presumption is that all of the services are monitored anyway and if they are, external entities can do whatever they want if needed.
For the target environment this presumption is wrong.
dBFT runs over the internet on chains with many millions of blocks. We know it's not perfect, but it works more often than not. If we have any problem with dBFT it needs to be properly investigated and fixed.
Again, this is irrelevant for the target case.
The problem with this PR is that in fact we don't know how it will affect the system at all.
It will give a change for the isolated system to self recover after the unforeseen problem. We don't plan to use it in public CN setup
Each node has a clock and a timer of its own, it's not the same as when you externally restart all of them within very short time frame.
Local time is NTP controlled and there is no external system to restart the service. Again, we don't want the full restart because other services will read the chain actively and with this watchdog we can avoid unnecessary service interrupt.
Ready for review, testing is in progress.
Does this work for more than 15m? I think you should
resetWatchdogTimer()
after every block and I don't see any code doing this.
That's a nice optimisation, and the case is that we shouldn't reset timer on each block. Imagine that we start timer from accepted block N with timestamp T. The single thing we're interested in is whether block height is increased at (T+15min) timestamp. If no, then block N+1 can't be accepted -> we should reload service. If yes, then block N+1
(and probably blocks N+2
, N+3
, ..., N+k
<= CurrentHeight
) are already accepted. If N+k
< CurrentHeight
then we're stuck at block N+k
, the latest timestamp is the timestamp of N+k
-th block, and what we have to do is to reset timer wrt N+k
timestamp and current time to reach overall 15-minutes threshold. This diff reset value will always be >0.
reset timer wrt N+k timestamp and current time to reach overall 15-minutes threshold
I think it'd be more clear if this timer only expired when there is a need to reset. Otherwise it would be restarted with every block. At least that's what I'd expect from a watchdog-like thing, even though I can't call this one a watchdog.
Rebased onto the current master.
It works somehow, at least peers disconnected before the consensus reloading and trying to connect again after new consensus instance is created.
2022-06-29T12:27:54.737Z INFO request change view {"view": 0, "height": 37, "reason": "Timeout", "new_view": 1, "nc": 0, "nf": 1}
2022-06-29T12:27:54.738Z INFO received ChangeView {"validator": 0, "reason": "Timeout", "new view": 1}
2022-06-29T12:27:54.999Z INFO new peer connected {"addr": "172.200.0.2:51942", "peerCount": 6}
2022-06-29T12:27:55.722Z WARN couldn't accept new block, sending signal to restart consensus service {"latest block index": 36, "latest block timestamp": 1656505660722, "time since latest block": "15s", "time till next restart": "15s"}
2022-06-29T12:27:55.722Z INFO restart signal is received from consensus watchdog service, restarting consensus service
2022-06-29T12:27:56.082Z WARN peer disconnected {"addr": "172.200.0.2:51818", "error": "unexpected empty payload: CMDVersion", "peerCount": 5}
2022-06-29T12:27:56.086Z WARN peer disconnected {"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 4}
2022-06-29T12:27:56.086Z WARN peer disconnected {"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 3}
2022-06-29T12:27:56.086Z WARN peer disconnected {"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 2}
2022-06-29T12:27:56.086Z WARN peer disconnected {"addr": "172.200.0.2:51934", "error": "dogs bite", "peerCount": 1}
2022-06-29T12:27:56.086Z INFO starting consensus service
2022-06-29T12:27:56.086Z WARN peer disconnected {"addr": "172.200.0.2:51942", "error": "dogs bite", "peerCount": 0}
2022-06-29T12:27:56.086Z INFO initializing dbft {"height": 37, "view": 0, "index": 2, "role": "Backup"}
2022-06-29T12:27:58.086Z INFO skip change view {"nc": 0, "nf": 3}
2022-06-29T12:27:59.087Z INFO new peer connected {"addr": "172.200.0.2:20334", "peerCount": 1}
2022-06-29T12:28:02.088Z WARN peer disconnected {"addr": "172.200.0.2:20334", "error": "unexpected empty payload: CMDVersion", "peerCount": 0}
2022-06-29T12:28:02.088Z INFO new peer connected {"addr": "172.200.0.2:20334", "peerCount": 1}
2022-06-29T12:28:02.088Z INFO skip change view {"nc": 0, "nf": 3}
2022-06-29T12:28:02.088Z INFO new peer connected {"addr": "172.200.0.2:51952", "peerCount": 2}
2022-06-29T12:28:02.088Z INFO new peer connected {"addr": "172.200.0.1:20333", "peerCount": 3}
2022-06-29T12:28:02.089Z INFO new peer connected {"addr": "172.200.0.1:47376", "peerCount": 4}
2022-06-29T12:28:02.089Z INFO started protocol {"addr": "172.200.0.2:20334", "userAgent": "/NEO-GO:0.99.1-pre-69-gb21a3c99/", "startHeight": 36, "id": 2730790704}
2022-06-29T12:28:02.089Z INFO new peer connected {"addr": "172.200.0.2:20334", "peerCount": 5}
2022-06-29T12:28:02.090Z WARN peer disconnected {"addr": "172.200.0.1:20333", "error": "identical node id", "peerCount": 4}
2022-06-29T12:28:02.090Z INFO new peer connected {"addr": "172.200.0.2:51966", "peerCount": 5}
2022-06-29T12:28:02.090Z INFO new peer connected {"addr": "172.200.0.2:51968", "peerCount": 6}
2022-06-29T12:28:02.090Z INFO new peer connected {"addr": "172.200.0.2:20334", "peerCount": 7}
2022-06-29T12:28:02.090Z INFO new peer connected {"addr": "172.200.0.1:20333", "peerCount": 8}
2022-06-29T12:28:02.090Z INFO new peer connected {"addr": "172.200.0.1:47394", "peerCount": 9}
- This branch is outdated and can't be merged immediately.
- It's been 6 months since this experiment and we had no issues that would require this.
- This logic breaks the basic dBFT assumption of
f
misbehaving nodes, we suddenly can have all nodes misbehaving and this can lead to unpredictable results. - Any other potential dBFT problems are a part of #2552 and will be investigated/fixed there properly.