helix icon indicating copy to clipboard operation
helix copied to clipboard

Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op

Open dang-stripe opened this issue 1 year ago • 1 comments

Describe the bug

We are using helix 1.3.1 from the pinot 1.2 release.

We've observed a bug in our Pinot clusters where an expensive operation like rebalancing or heavy ingestion causes some servers to hit GC pauses that are longer than the ZK session timeout.

When this happens, the session expires and the client will reconnect. However, we observe session ID mismatches from the logs below continuously until we manually restart the server. Note that the expected + target session from these warning logs also change over time.

[2024-10-22 17:12:13.511581] WARN [HelixTaskExecutor] [ZkClient-EventThread-110-server1:2181,server2:2181,server3:2181:110] SessionId does NOT match. expected sessionId: 2003518003bd305, tgtSessionId in message: 5002ac58f6ad323, messageId: 39f20bfc-a607-44ca-8354-5e3736ebfd68

While the session ID mismatches are happening, Helix messages are treated as no-op causing the Pinot server to get into a bad state. We've traced that back to this code: https://github.com/apache/helix/blob/3055f26aa4f040f549ca91581fc6b2ff6b275f33/helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java#L1089-L1106

We only see 1-2 instances of the session syncing in our logs which the code does to remedy the issue but it does not seem to trigger as often as we'd expect.

[2024-10-22 02:18:59.913004] INFO [HelixTaskExecutor] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Participant Server_st-noir-test-uswest2b-1_8098 syncs session with controller

To Reproduce

  1. Perform expensive rebalance or heavy ingestion on Pinot server that would trigger GC pause to exceed ZK session timeout
  2. Observe logs for session ID mismatches

Expected behavior

Participant is able to successfully sync its session so they no longer mismatch and server can heal itself

dang-stripe avatar Oct 22 '24 21:10 dang-stripe

Hi @dang-stripe did you see this ERROR msg in server logs? Which got printed every minute.

… ERROR [ZKHelixManager] [ZkClient-EventThread-172-pinot-zookeeper-headless:2181] fail to connect zkserver: pinot-zookeeper-headless:2181 in 60000ms. expiredSessionId: 3002f8e84690016, clusterName: pinot

We saw a similar issue and still root causing it. After long GC pauses, the server was removed from the LIVE Instance list, since its ZK session expired. Although the server started a session per the server logs but it could not update it to Helix. From the error logs above, the server kept trying to reconnect every minute but all failed somehow.

klsince avatar Oct 23 '24 01:10 klsince

@klsince We do not see that error in logs. After looking deeper, I do see that the Helix ZK client is struggling to maintain a persistent connection with ZK.

[2024-10-23 21:28:43.769971] INFO [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState: SyncConnected, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:29:32.011416] WARN [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState:Disconnected, SessionId: 10071459e8408e9, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:29:32.157984] INFO [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState: SyncConnected, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:30:11.646452] WARN [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState:Disconnected, SessionId: 10071459e8408e9, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:30:12.001107] WARN [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState:Expired, SessionId: 10071459e8408e9, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:30:12.022364] INFO [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] KeeperState: SyncConnected, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT
[2024-10-23 21:30:12.024330] INFO [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Handling new session, session id: 10071459e840915, instance: Server_st-noir-test-uswest2b-1_8098, instanceTye: PARTICIPANT, cluster: rad-noir
[2024-10-23 21:30:12.024355] INFO [ZKHelixManager] [ZkClient-EventThread-111-server1:2181,server2:2181,server3:2181:111] Handle new session, instance: Server_st-noir-test-uswest2b-1_8098, type: PARTICIPANT, session id: 10071459e840915.
[2024-10-23 21:30:12.252857] WARN [ZKHelixManager] [HelixTaskExecutor-message_handle_STATE_TRANSITION:1688] zkClient to server1:2181,server2:2181,server3:2181 is not connected, wait for 180000ms.

When I look at ZK logs for one of the session IDs, it seems like ZK is trying to re-validate the session, but the client terminates its connection before that could be done.

[2024-10-23 21:30:12.021291] INFO  LeaderSessionTracker:104 - Committing global session 0x10071459e840915
[2024-10-23 21:30:12.021328] INFO  LearnerSessionTracker:116 - Committing global session 0x10071459e840915
[2024-10-23 21:30:12.021495] INFO  LearnerSessionTracker:116 - Committing global session 0x10071459e840915
[2024-10-23 21:30:12.021550] INFO  LearnerSessionTracker:116 - Committing global session 0x10071459e840915
[2024-10-23 21:30:12.021704] INFO  LearnerSessionTracker:116 - Committing global session 0x10071459e840915
[2024-10-23 21:30:36.875070] WARN  NIOServerCnxn:364 - Unexpected exception
[2024-10-23 21:30:36.875101] EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /10.76.213.167:33230, session = 0x10071459e840915
[2024-10-23 21:30:36.875113] 	at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
[2024-10-23 21:30:36.875126] 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
[2024-10-23 21:30:36.875136] 	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
Show all 9 lines
[2024-10-23 21:30:37.097688] INFO  Learner:137 - Revalidating client: 0x10071459e840915
[2024-10-23 21:31:23.560963] WARN  NIOServerCnxn:364 - Unexpected exception
[2024-10-23 21:31:23.561024] EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /10.76.213.167:56312, session = 0x10071459e840915
[2024-10-23 21:31:23.561041] 	at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
[2024-10-23 21:31:23.561049] 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
[2024-10-23 21:31:23.561059] 	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
Show all 9 lines
[2024-10-23 21:32:12.194924] WARN  NIOServerCnxn:364 - Unexpected exception
[2024-10-23 21:32:12.194955] EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /10.76.213.167:35592, session = 0x10071459e840915
[2024-10-23 21:32:12.194967] 	at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
[2024-10-23 21:32:12.194975] 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
[2024-10-23 21:32:12.194986] 	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
Show all 9 lines
[2024-10-23 21:32:12.444665] INFO  Learner:137 - Revalidating client: 0x10071459e840915

dang-stripe avatar Oct 24 '24 19:10 dang-stripe

@klsince FYI I paired with Jackie and we narrowed it down to high GC on the server causing the ZK disconnects. I filed https://github.com/apache/pinot/issues/14301 as a follow up. Going to close this out.

dang-stripe avatar Oct 24 '24 22:10 dang-stripe