rqlite icon indicating copy to clipboard operation
rqlite copied to clipboard

Leader continues trying to heartbeat removed node

Open aderouineau opened this issue 1 year ago • 6 comments

What version are you running?

8.21.1

Are you using Docker or Kubernetes to run your system?

No

Are you running a single node or a cluster?

Single node + read-only node

What did you do?

  1. Started first node (ID=1) with no data
  2. Started read-only node (ID=2)
  3. Exited out of rqlited for node 2
  4. Ran .remove 2 in rqlite client

What did you expect to happen?

I expected node 1 to stop heartbeating node 2.

What happened instead?

rqlited on node 1 continued trying to heartbeat node 2:

[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[raft] 2024/02/23 02:31:16 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 received request to remove node 2
[raft] 2024/02/23 02:31:17 [INFO]  updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 02:31:17 [INFO]  removed peer, stopping replication: peer=2 last-index=4
[store] 2024/02/23 02:31:17 node 2 removed successfully
[raft] 2024/02/23 02:31:17 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:17 node 2 is not present in configuration
[raft] 2024/02/23 02:31:18 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 127.0.0.1:4012: connect: connection refused"
[store] 2024/02/23 02:31:18 node 2 is not present in configuration
...

aderouineau avatar Feb 23 '24 02:02 aderouineau

Please show me the output of status on node 1.

otoolep avatar Feb 23 '24 12:02 otoolep

Well this is strange, it shouldn't. This is something that Raft is doing underneath the covers, it's not rqlite. I'll look into it, definitely doesn't look right.

otoolep avatar Feb 23 '24 12:02 otoolep

After .remove 2, /status correctly only shows the first node:

      {
        "id": "1",
        "addr": "localhost:4002",
        "suffrage": "Voter"
      }

The heartbeat actually stops after a while:

[store] 2024/02/23 14:01:49 received request to remove node 2
[raft] 2024/02/23 14:01:49 [INFO]  updating configuration: command=RemoveServer server-id=2 server-addr= servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
[raft] 2024/02/23 14:01:49 [INFO]  removed peer, stopping replication: peer=2 last-index=6
[store] 2024/02/23 14:01:49 node 2 removed successfully
[raft] 2024/02/23 14:01:49 [ERROR] failed to heartbeat to: peer=localhost:4012 backoff time=500ms error="dial tcp 10.189.38.226:4012: connect: connection refused"
[store] 2024/02/23 14:01:49 node 2 is not present in configuration
...
[store] 2024/02/23 14:02:01 node 2 is not present in configuration
[raft] 2024/02/23 14:02:02 [ERROR] failed to appendEntries to: peer="{Nonvoter 2 localhost:4012}" error="dial tcp 10.189.38.226:4012: connect: connection refused"

Then it stops. That's about a 30 seconds gap between removing the node and the heartbeat stopping.

aderouineau avatar Feb 23 '24 14:02 aderouineau

That's kinda long. I'll run some tests to see if this is a regression in stopping the heartbeat.

otoolep avatar Feb 23 '24 15:02 otoolep

I'd be curious if you're clearly and consistently seeing better performance i.e. heartbeating stops much sooner with, say, 8.0.

otoolep avatar Feb 23 '24 21:02 otoolep

I ran a quick test with 7.21. With that software it can many seconds before heartbeating stops. Sometimes it's fast, other times it's not. So this behaviour is not new, and appears to be something in the Raft system itself. I need to write a simple test case to check however. We could also ask on the Raft GitHub repo.

otoolep avatar Feb 24 '24 15:02 otoolep

I believe this is a raft limitation. I also have a toy kv store that I built and I see the same behavior when I remove a node, the heartbeat still continues for some time.

mauri870 avatar Feb 27 '24 21:02 mauri870

OK, I don't think there is anything I can do here in that case.

otoolep avatar Feb 28 '24 12:02 otoolep