rqlite Leader continues trying to heartbeat removed node

What version are you running?

8.21.1

Are you using Docker or Kubernetes to run your system?

No

Are you running a single node or a cluster?

Single node + read-only node

What did you do?

Started first node (ID=1) with no data
Started read-only node (ID=2)
Exited out of rqlited for node 2
Ran .remove 2 in rqlite client

What did you expect to happen?

I expected node 1 to stop heartbeating node 2.

What happened instead?

rqlited on node 1 continued trying to heartbeat node 2:

[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 received request to remove node 2
[raft] 2024/02/23 02:31:17 [INFO]  updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 02:31:17 [INFO]  removed peer, stopping replication: peer=2 last-index=4
[store] 2024/02/23 02:31:17 node 2 removed successfully
[raft] 2024/02/23 02:31:17 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 node 2 is not present in configuration
[raft] 2024/02/23 02:31:18 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:18 node 2 is not present in configuration
...

Feb 23 '24 02:02 aderouineau

Please show me the output of status on node 1.

Feb 23 '24 12:02 otoolep

Well this is strange, it shouldn't. This is something that Raft is doing underneath the covers, it's not rqlite. I'll look into it, definitely doesn't look right.

Feb 23 '24 12:02 otoolep

After .remove 2, /status correctly only shows the first node:

      {
        "id": "1",
        "addr": "localhost:4002",
        "suffrage": "Voter"
      }

The heartbeat actually stops after a while:

[store] 2024/02/23 14:01:49 received request to remove node 2
[raft] 2024/02/23 14:01:49 [INFO]  updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 14:01:49 [INFO]  removed peer, stopping replication: peer=2 last-index=6
[store] 2024/02/23 14:01:49 node 2 removed successfully
[raft] 2024/02/23 14:01:49 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 10.189.38.226:4012: connect: connection refused"
[store] 2024/02/23 14:01:49 node 2 is not present in configuration
...
[store] 2024/02/23 14:02:01 node 2 is not present in configuration
[raft] 2024/02/23 14:02:02 [ERROR] failed to appendEntries to: peer="{Nonvoter 2 localhost:4012}" error="dial tcp 10.189.38.226:4012: connect: connection refused"

Then it stops. That's about a 30 seconds gap between removing the node and the heartbeat stopping.

Feb 23 '24 14:02 aderouineau

That's kinda long. I'll run some tests to see if this is a regression in stopping the heartbeat.

Feb 23 '24 15:02 otoolep

I'd be curious if you're clearly and consistently seeing better performance i.e. heartbeating stops much sooner with, say, 8.0.

Feb 23 '24 21:02 otoolep

I ran a quick test with 7.21. With that software it can many seconds before heartbeating stops. Sometimes it's fast, other times it's not. So this behaviour is not new, and appears to be something in the Raft system itself. I need to write a simple test case to check however. We could also ask on the Raft GitHub repo.

Feb 24 '24 15:02 otoolep

I believe this is a raft limitation. I also have a toy kv store that I built and I see the same behavior when I remove a node, the heartbeat still continues for some time.

Feb 27 '24 21:02 mauri870

OK, I don't think there is anything I can do here in that case.

Feb 28 '24 12:02 otoolep