neo-go icon indicating copy to clipboard operation
neo-go copied to clipboard

consensus: add dBFT watchdog

Open AnnaShaleva opened this issue 2 years ago • 8 comments

Requested by @realloc.

Problem

dBFT is not perfect, and sometimes it hangs in a deadlock.

Solution

Add dBFT watchdog. Watchdog tracks time spent since the last accepted block and enforses the consensus reset on threshold.

No documentation update included. If we'll decide to include this feature into master, then a separate PR will be created.

Depends on #2567 and #2566.

AnnaShaleva avatar Jun 27 '22 12:06 AnnaShaleva

The meaning of the watchdog is to prevent any chain halt situation that we couldn't foresee. That's the requirement for on-prem installs and it's not related to #2552 at all.

In on-prem installs, where neo-go is used as a database, we can't restart neo-go instance by external system as it affects the whole stack operations. At least 3 times on prod we had a situation with locked consensus mechanism causing the whole cluster malfunction, which was fixed by simultaneous neo-go nodes restart. To prevent such situation we need this kind of watchdog doing a simultaneous consensus mechanism restart without affecting the RPC operations. As we discussed with @fyrchik and @AnnaShaleva, the restart could be tied to the nearest round number minute or similar point in time.

As we need this anyway, let's try to also attach some instructions for operations people on how to gather the most of useful debug information when they hit the consensus halt situation again.

realloc avatar Jun 28 '22 03:06 realloc

we can't restart neo-go instance by external system was fixed by simultaneous neo-go nodes restart

Hmm.

My presumption is that all of the services are monitored anyway and if they are, external entities can do whatever they want if needed.

we had a situation with locked consensus mechanism

dBFT runs over the internet on chains with many millions of blocks. We know it's not perfect, but it works more often than not. If we have any problem with dBFT it needs to be properly investigated and fixed.

The problem with this PR is that in fact we don't know how it will affect the system at all. Each node has a clock and a timer of its own, it's not the same as when you externally restart all of them within very short time frame.

roman-khimov avatar Jun 28 '22 06:06 roman-khimov

As it is written in the code comment, we need this option to be toggled in on-prem environments only.

My presumption is that all of the services are monitored anyway and if they are, external entities can do whatever they want if needed.

For the target environment this presumption is wrong.

dBFT runs over the internet on chains with many millions of blocks. We know it's not perfect, but it works more often than not. If we have any problem with dBFT it needs to be properly investigated and fixed.

Again, this is irrelevant for the target case.

The problem with this PR is that in fact we don't know how it will affect the system at all.

It will give a change for the isolated system to self recover after the unforeseen problem. We don't plan to use it in public CN setup

Each node has a clock and a timer of its own, it's not the same as when you externally restart all of them within very short time frame.

Local time is NTP controlled and there is no external system to restart the service. Again, we don't want the full restart because other services will read the chain actively and with this watchdog we can avoid unnecessary service interrupt.

realloc avatar Jun 28 '22 06:06 realloc

Ready for review, testing is in progress.

AnnaShaleva avatar Jun 28 '22 11:06 AnnaShaleva

Does this work for more than 15m? I think you should resetWatchdogTimer() after every block and I don't see any code doing this.

That's a nice optimisation, and the case is that we shouldn't reset timer on each block. Imagine that we start timer from accepted block N with timestamp T. The single thing we're interested in is whether block height is increased at (T+15min) timestamp. If no, then block N+1 can't be accepted -> we should reload service. If yes, then block N+1 (and probably blocks N+2, N+3, ..., N+k <= CurrentHeight) are already accepted. If N+k < CurrentHeight then we're stuck at block N+k, the latest timestamp is the timestamp of N+k-th block, and what we have to do is to reset timer wrt N+k timestamp and current time to reach overall 15-minutes threshold. This diff reset value will always be >0.

AnnaShaleva avatar Jun 28 '22 12:06 AnnaShaleva

reset timer wrt N+k timestamp and current time to reach overall 15-minutes threshold

I think it'd be more clear if this timer only expired when there is a need to reset. Otherwise it would be restarted with every block. At least that's what I'd expect from a watchdog-like thing, even though I can't call this one a watchdog.

roman-khimov avatar Jun 28 '22 12:06 roman-khimov

Rebased onto the current master.

AnnaShaleva avatar Jun 28 '22 16:06 AnnaShaleva

It works somehow, at least peers disconnected before the consensus reloading and trying to connect again after new consensus instance is created.

2022-06-29T12:27:54.737Z	INFO	request change view	{"view": 0, "height": 37, "reason": "Timeout", "new_view": 1, "nc": 0, "nf": 1}
2022-06-29T12:27:54.738Z	INFO	received ChangeView	{"validator": 0, "reason": "Timeout", "new view": 1}
2022-06-29T12:27:54.999Z	INFO	new peer connected	{"addr": "172.200.0.2:51942", "peerCount": 6}
2022-06-29T12:27:55.722Z	WARN	couldn't accept new block, sending signal to restart consensus service	{"latest block index": 36, "latest block timestamp": 1656505660722, "time since latest block": "15s", "time till next restart": "15s"}
2022-06-29T12:27:55.722Z	INFO	restart signal is received from consensus watchdog service, restarting consensus service
2022-06-29T12:27:56.082Z	WARN	peer disconnected	{"addr": "172.200.0.2:51818", "error": "unexpected empty payload: CMDVersion", "peerCount": 5}
2022-06-29T12:27:56.086Z	WARN	peer disconnected	{"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 4}
2022-06-29T12:27:56.086Z	WARN	peer disconnected	{"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 3}
2022-06-29T12:27:56.086Z	WARN	peer disconnected	{"addr": "172.200.0.2:20334", "error": "dogs bite", "peerCount": 2}
2022-06-29T12:27:56.086Z	WARN	peer disconnected	{"addr": "172.200.0.2:51934", "error": "dogs bite", "peerCount": 1}
2022-06-29T12:27:56.086Z	INFO	starting consensus service
2022-06-29T12:27:56.086Z	WARN	peer disconnected	{"addr": "172.200.0.2:51942", "error": "dogs bite", "peerCount": 0}
2022-06-29T12:27:56.086Z	INFO	initializing dbft	{"height": 37, "view": 0, "index": 2, "role": "Backup"}
2022-06-29T12:27:58.086Z	INFO	skip change view	{"nc": 0, "nf": 3}
2022-06-29T12:27:59.087Z	INFO	new peer connected	{"addr": "172.200.0.2:20334", "peerCount": 1}
2022-06-29T12:28:02.088Z	WARN	peer disconnected	{"addr": "172.200.0.2:20334", "error": "unexpected empty payload: CMDVersion", "peerCount": 0}
2022-06-29T12:28:02.088Z	INFO	new peer connected	{"addr": "172.200.0.2:20334", "peerCount": 1}
2022-06-29T12:28:02.088Z	INFO	skip change view	{"nc": 0, "nf": 3}
2022-06-29T12:28:02.088Z	INFO	new peer connected	{"addr": "172.200.0.2:51952", "peerCount": 2}
2022-06-29T12:28:02.088Z	INFO	new peer connected	{"addr": "172.200.0.1:20333", "peerCount": 3}
2022-06-29T12:28:02.089Z	INFO	new peer connected	{"addr": "172.200.0.1:47376", "peerCount": 4}
2022-06-29T12:28:02.089Z	INFO	started protocol	{"addr": "172.200.0.2:20334", "userAgent": "/NEO-GO:0.99.1-pre-69-gb21a3c99/", "startHeight": 36, "id": 2730790704}
2022-06-29T12:28:02.089Z	INFO	new peer connected	{"addr": "172.200.0.2:20334", "peerCount": 5}
2022-06-29T12:28:02.090Z	WARN	peer disconnected	{"addr": "172.200.0.1:20333", "error": "identical node id", "peerCount": 4}
2022-06-29T12:28:02.090Z	INFO	new peer connected	{"addr": "172.200.0.2:51966", "peerCount": 5}
2022-06-29T12:28:02.090Z	INFO	new peer connected	{"addr": "172.200.0.2:51968", "peerCount": 6}
2022-06-29T12:28:02.090Z	INFO	new peer connected	{"addr": "172.200.0.2:20334", "peerCount": 7}
2022-06-29T12:28:02.090Z	INFO	new peer connected	{"addr": "172.200.0.1:20333", "peerCount": 8}
2022-06-29T12:28:02.090Z	INFO	new peer connected	{"addr": "172.200.0.1:47394", "peerCount": 9}

AnnaShaleva avatar Jun 29 '22 12:06 AnnaShaleva

  1. This branch is outdated and can't be merged immediately.
  2. It's been 6 months since this experiment and we had no issues that would require this.
  3. This logic breaks the basic dBFT assumption of f misbehaving nodes, we suddenly can have all nodes misbehaving and this can lead to unpredictable results.
  4. Any other potential dBFT problems are a part of #2552 and will be investigated/fixed there properly.

roman-khimov avatar Dec 28 '22 07:12 roman-khimov