restate icon indicating copy to clipboard operation
restate copied to clipboard

[Jepsen] Failure to recover after network partition

Open pcholakov opened this issue 8 months ago • 6 comments

Cluster failed to get back to healthy state after the network partitions. Our liveness checker only looks at the latest recovery period but looking at the graph below, the same thing happened after the network partitions ended at the 10s, 30s, and 65s marks:

https://github.com/restatedev/jepsen/actions/runs/14188229049

Image

pcholakov avatar Apr 01 '25 08:04 pcholakov

Is the issue reproducible? If yes, can we capture the debug logs to see what's happening?

tillrohrmann avatar Apr 01 '25 13:04 tillrohrmann

I have not been able to reproduce it even once, but I have annotated logs from this morning's failure and a good theory about what's going on. The network partitions are causing the cluster controller leadership to move around; sometimes nodes will regain connectivity but still believe others to be "dead" following the partition. Then, two cluster controllers can send conflicting control processors command to their peers, stopping the processors.

Here's a distilled view of network partitions and cluster controller changes and actions:

2025-04-01 06:16:01.692582 n1 Cluster controller switching to leader mode
2025-04-01 06:16:02.792671 n1 Starting new partition processor to run as Follower partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:02.794115 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:03.624098 n2 Starting new partition processor to run as Leader partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:03.676984 n3 Starting new partition processor to run as Follower partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:12,587    jp [:isolated {"node3" #{"node2" "node1"}, "node2" #{"node3"}, "node1" #{"node3"}}]
2025-04-01 06:16:13.675571 n3 Cluster controller switching to leader mode
2025-04-01 06:16:17,706    jp :network-healed
2025-04-01 06:16:17.769233 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N3:1
2025-04-01 06:16:17.871442 n3 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N3:1
2025-04-01 06:16:18.333548 n3 Cluster controller switching to follower mode, I think the leader is N1:2
2025-04-01 06:16:22,721    jp [:isolated {"node1" #{"node2" "node3"}, "node2" #{"node1"}, "node3" #{"node1"}}]
2025-04-01 06:16:24.238751 n2 Cluster controller switching to leader mode
2025-04-01 06:16:27,826    jp :network-healed
2025-04-01 06:16:28.585716 n2 Cluster controller switching to follower mode, I think the leader is N1:2
2025-04-01 06:16:28.652621 n1 Asked to stop partition processor by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:28.660490 n1 Starting new partition processor to run as Leader partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:29.553875 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:29.618232 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:29.669768 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:29.669973 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:29.670106 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:30.591728 n2 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:30.692739 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:30.692771 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:32,843    jp [:isolated {"node3" #{"node2" "node1"}, "node2" #{"node3"}, "node1" #{"node3"}}]
2025-04-01 06:16:34.688522 n3 Cluster controller switching to leader mode
2025-04-01 06:16:37,950    jp :network-healed
2025-04-01 06:16:38.383505 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N3:1
2025-04-01 06:16:38.471915 n3 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N3:1
2025-04-01 06:16:38.741338 n3 Cluster controller switching to follower mode, I think the leader is N1:2
2025-04-01 06:16:42,966    jp [:isolated {"node2" #{"node3" "node1"}, "node3" #{"node2"}, "node1" #{"node2"}}]
2025-04-01 06:16:44.543292 n2 Cluster controller switching to leader mode
2025-04-01 06:16:48,078    jp :network-healed
2025-04-01 06:16:48.635117 n2 Cluster controller switching to follower mode, I think the leader is N1:2
2025-04-01 06:16:48.788483 n1 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:51.624971 n2 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:51.625573 n2 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:53,093    jp [:isolated {"node1" #{"node2" "node3"}, "node2" #{"node1"}, "node3" #{"node1"}}]
2025-04-01 06:16:54.592819 n2 Cluster controller switching to leader mode
2025-04-01 06:16:57.723456 n3 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N2:1
2025-04-01 06:16:58,201    jp :network-healed
2025-04-01 06:16:58.769618 n2 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:58.806032 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.434169 n2 Asked to stop partition processor by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.435252 n3 Asked to stop partition processor by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.524692 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.525494 n1 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.547027 n3 Starting new partition processor to run as Leader partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:16:59.644281 n2 Starting new partition processor to run as Follower partition_id=0 from_cluster_controller=N2:1
2025-04-01 06:17:00.685807 n2 Cluster controller switching to follower mode, I think the leader is N1:2
2025-04-01 06:17:03,226    jp [:isolated {"node3" #{"node2" "node1"}, "node2" #{"node3"}, "node1" #{"node3"}}]
2025-04-01 06:17:04.740923 n3 Cluster controller switching to leader mode
2025-04-01 06:17:07.489429 n2 Asked to run as leader by cluster controller. Obtaining required leader epoch partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:17:09,321    jp :network-healed
2025-04-01 06:17:09.844472 n3 Cluster controller switching to follower mode, I think the leader is N2:1
2025-04-01 06:17:11.529246 n3 Asked to stop partition processor by cluster controller partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:17:11.618077 n3 Starting new partition processor to run as Follower partition_id=0 from_cluster_controller=N1:2
2025-04-01 06:17:11.618117 n3 Asked to run as follower by cluster controller partition_id=0 from_cluster_controller=N1:2

pcholakov avatar Apr 01 '25 13:04 pcholakov

The full logs are attached as an artifact to the run, but copying them here for safekeeping.

jepsen-store.zip

This makes the raw logs a lot easier to read: sed -f node-replacements.sed -i n?_*/restate/restate.log jepsen.log history.edn

s/ec2-44-211-255-253.compute-1.amazonaws.com/node1/g
s/ec2-54-164-74-146.compute-1.amazonaws.com/node2/g
s/ec2-107-23-252-28.compute-1.amazonaws.com/node3/g

pcholakov avatar Apr 01 '25 13:04 pcholakov

Summary:

  • After the partition healed, a handful of client calls completed, but then all calls started timing out after 2s. However, n2 which was the active partition processor for the sole partition at the time was logging successfully processing requests.
  • The average request latency jumped to over 3s, which is higher than the set client timeout at 2s. This caused the final wave of timeouts - however the calls were completing successfully.

Other observations:

  • After recovery, the isolated node N3 thought it was the leading cluster controller, presumably because it had an outdated view of dead hosts but fresh view of metadata (it tried to update the partition table as soon as it was able, which forced a refresh)
  • Both N3 and N1 sent control processors commands after the partition healed but this did not lead to anything bad

pcholakov avatar Apr 02 '25 07:04 pcholakov

Same root cause with the failure in https://github.com/restatedev/jepsen/actions/runs/14205571402 - the timeouts were legitimate, caused by a partition processor slow-down followed by a leader change immediately after the partition healed. It that case, it seems that the partition leadership change also contributed to the slower response times. There is no correctness issue underlying either of these failed runs however.

pcholakov avatar Apr 02 '25 08:04 pcholakov

Another recurrence: https://github.com/restatedev/jepsen/actions/runs/14287188130

I've increased the heal timeout to 60s to see if things actually converge.

In the meantime, the following changes do make a difference to the convergence:

  • lowering the metadata sync interval to something ~ heartbeat interval (1s) consistently fixes the issue
  • adding a metadata staleness check to the cluster controller leadership decision (e.g. last successful refresh > 2x heatbeat interval)

pcholakov avatar Apr 07 '25 10:04 pcholakov

Confident that with the latest gossip changes this is now behind us. I've fixed the false negatives from the nightly tests and will continue to monitor the runs, but here's an example of the improvements in recovery with the latest main builds:

Image

I will lower the recovery wait period back down to 10-20s which will reduce the amount of time the tests run substantially.

pcholakov avatar Jun 16 '25 20:06 pcholakov