Leader continues trying to heartbeat removed node
What version are you running?
8.21.1
Are you using Docker or Kubernetes to run your system?
No
Are you running a single node or a cluster?
Single node + read-only node
What did you do?
- Started first node (ID=1) with no data
- Started read-only node (ID=2)
- Exited out of rqlited for node 2
- Ran
.remove 2in rqlite client
What did you expect to happen?
I expected node 1 to stop heartbeating node 2.
What happened instead?
rqlited on node 1 continued trying to heartbeat node 2:
[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 received request to remove node 2
[raft] 2024/02/23 02:31:17 [INFO] updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 02:31:17 [INFO] removed peer, stopping replication: peer=2 last-index=4
[store] 2024/02/23 02:31:17 node 2 removed successfully
[raft] 2024/02/23 02:31:17 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 node 2 is not present in configuration
[raft] 2024/02/23 02:31:18 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:18 node 2 is not present in configuration
...
Please show me the output of status on node 1.
Well this is strange, it shouldn't. This is something that Raft is doing underneath the covers, it's not rqlite. I'll look into it, definitely doesn't look right.
After .remove 2, /status correctly only shows the first node:
{
"id": "1",
"addr": "localhost:4002",
"suffrage": "Voter"
}
The heartbeat actually stops after a while:
[store] 2024/02/23 14:01:49 received request to remove node 2
[raft] 2024/02/23 14:01:49 [INFO] updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 14:01:49 [INFO] removed peer, stopping replication: peer=2 last-index=6
[store] 2024/02/23 14:01:49 node 2 removed successfully
[raft] 2024/02/23 14:01:49 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 10.189.38.226:4012: connect: connection refused"
[store] 2024/02/23 14:01:49 node 2 is not present in configuration
...
[store] 2024/02/23 14:02:01 node 2 is not present in configuration
[raft] 2024/02/23 14:02:02 [ERROR] failed to appendEntries to: peer="{Nonvoter 2 localhost:4012}" error="dial tcp 10.189.38.226:4012: connect: connection refused"
Then it stops. That's about a 30 seconds gap between removing the node and the heartbeat stopping.
That's kinda long. I'll run some tests to see if this is a regression in stopping the heartbeat.
I'd be curious if you're clearly and consistently seeing better performance i.e. heartbeating stops much sooner with, say, 8.0.
I ran a quick test with 7.21. With that software it can many seconds before heartbeating stops. Sometimes it's fast, other times it's not. So this behaviour is not new, and appears to be something in the Raft system itself. I need to write a simple test case to check however. We could also ask on the Raft GitHub repo.
I believe this is a raft limitation. I also have a toy kv store that I built and I see the same behavior when I remove a node, the heartbeat still continues for some time.
OK, I don't think there is anything I can do here in that case.