[BUG] Nodes removed from cluster via FORGET will eventually reintroduce themselves since 8.1
Describe the bug
Forgotten nodes reintroduce themselves after https://github.com/valkey-io/valkey/pull/1307, supposing they are still up after the ban-list expiration
To reproduce
Create a cluster:
utils/create-cluster/create-cluster start && utils/create-cluster/create-cluster create
Forget one node:
for i in {30200..30204}; do echo CLUSTER FORGET b0f0533189bdfe4db091bee55076a067f062a565 | src/valkey-cli -p $i; done
Wait and watch it add itself back:
127.0.0.1:30200> CLUSTER NODES
f7685174f0cfe57f5bfadb2578a6bb9b4ff534db 127.0.0.1:30200@40200 myself,master - 0 0 7 connected 0-2730 13653-16383
bd0907d534a02b7f1fa0557064448fe506b7f113 127.0.0.1:30201@40201 master - 0 1761870957659 2 connected 2731-5460
721e5634aee2fb4c14072d6931bd6500a1be2d96 127.0.0.1:30203@40203 master - 0 1761870958061 4 connected 8192-10922
681f9c9da91fcaec68fdf80097ddebc842cec1f3 127.0.0.1:30204@40204 master - 0 1761870958062 5 connected 10923-13652
e30c5f8cfdb1ecb8d094de8716097aaf52c0eb02 127.0.0.1:30202@40202 master - 0 1761870958062 3 connected 5461-8191
...
127.0.0.1:30200> CLUSTER NODES
f7685174f0cfe57f5bfadb2578a6bb9b4ff534db 127.0.0.1:30200@40200 myself,master - 0 0 7 connected 0-2730 13653-16383
bd0907d534a02b7f1fa0557064448fe506b7f113 127.0.0.1:30201@40201 master - 0 1761871117483 2 connected 2731-5460
b0f0533189bdfe4db091bee55076a067f062a565 127.0.0.1:30205@40205 slave f7685174f0cfe57f5bfadb2578a6bb9b4ff534db 0 1761871117483 7 connected
721e5634aee2fb4c14072d6931bd6500a1be2d96 127.0.0.1:30203@40203 master - 0 1761871118086 4 connected 8192-10922
681f9c9da91fcaec68fdf80097ddebc842cec1f3 127.0.0.1:30204@40204 master - 0 1761871118086 5 connected 10923-13652
e30c5f8cfdb1ecb8d094de8716097aaf52c0eb02 127.0.0.1:30202@40202 master - 0 1761871118086 3 connected 5461-8191
On 8.0, the node stays forgotten.
Expected behavior
Nodes that are forgotten should not be added back to the cluster, per the CLUSTER FORGET documentation:
in order for a node to be completely removed from a cluster, the CLUSTER FORGET command must be sent to all the remaining nodes, regardless of the fact they are primaries or replicas.
Additional information
It's because the old node is rejoining via automated CLUSTER MEET calls:
...
1503237:S 31 Oct 2025 00:36:27.906 * Sending MEET packet to node e30c5f8cfdb1ecb8d094de8716097aaf52c0eb02 () because there is no inbound link for it
1503237:S 31 Oct 2025 00:36:27.907 * Successfully completed handshake with e30c5f8cfdb1ecb8d094de8716097aaf52c0eb02 ()
If the node is not removed before the ban-list expiration, eventually these succeed.
Damn! Seems obvious now.
This is kind of surprising behavior, I thought the logic was that we only tried to re-establish the connection in the handshake phase. There is also a case the node is just disconnected for a bit, and it shouldn't try to re-send the MEET packet.
Wouldn't it be possible to mitigate this issue by only executing that logic if the inbound_link_freed_time is less than the BlackList TTL?