venice
venice copied to clipboard
[BUG] ZK replica status update interrupted causes error replicas
Willingness to contribute
No. I cannot contribute a bug fix at this time.
Venice version
0.4.139
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 20.0): Mariner 5.15.111.1-1.cm2
- JDK version: 17
Describe the problem
Partitions move to error state due to zk update getting interrupted unexpectedly. We need a strategy that makes this more resilient, temporary disconnects should be tolerable. Exception trace below:
Tracking information
2023/07/10 17:49:59.569 ERROR [IngestionNotificationDispatcher for [ Topic: venice_system_store_participant_store_cluster_cert-1_v196 ] ] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] Error reporting status to notifier class com.linkedin.davinci.notifier.PushMonitorNotifier
com.linkedin.venice.exceptions.ZkDataAccessException: Can not do operation:compare and update on path: /cert-1/OfflinePushes/venice_system_store_participant_store_cluster_cert-1_v196/2 after retry:3 times
at com.linkedin.venice.utils.HelixUtils.compareAndUpdate(HelixUtils.java:235) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.utils.HelixUtils.compareAndUpdate(HelixUtils.java:220) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.VeniceOfflinePushMonitorAccessor.compareAndUpdateReplicaStatus(VeniceOfflinePushMonitorAccessor.java:290) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.VeniceOfflinePushMonitorAccessor.updateReplicaStatus(VeniceOfflinePushMonitorAccessor.java:250) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.davinci.notifier.PushMonitorNotifier.started(PushMonitorNotifier.java:48) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.notifier.VeniceNotifier.started(VeniceNotifier.java:17) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.lambda$reportStarted$1(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:73) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:96) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.reportStarted(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.lambda$reportStarted$1(StatusReportAdapter.java:84) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.maybeReportStatus(StatusReportAdapter.java:234) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.recordSubPartitionStatus(StatusReportAdapter.java:220) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:141) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:132) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.reportStarted(StatusReportAdapter.java:84) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.processStartOfPush(StoreIngestionTask.java:2290) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.produceToStoreBufferServiceOrKafka(StoreIngestionTask.java:991) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:75) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:17) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.ConsumptionTask.run(ConsumptionTask.java:143) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023/07/10 17:49:59.569 WARN [ZKHelixManager] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] zkClient to zk-ltx1-venice.stg.linkedin.com:2622/venice is not connected, wait for 10000ms.
2023/07/10 17:49:59.569 ERROR [IngestionNotificationDispatcher for [ Topic: venice_system_store_participant_store_cluster_cert-1_v196 ] ] [venice-shared-consumer-for-kafka.venice.kafka.ei-ltx1.atd.stg.linkedin.com:16637-t1] [venice-server-war] [] Error reporting status to notifier class com.linkedin.davinci.notifier.PartitionPushStatusNotifier
org.apache.helix.zookeeper.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
at org.apache.helix.zookeeper.zkclient.ZkClient.acquireEventLock(ZkClient.java:1942) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.waitForKeeperState(ZkClient.java:1919) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.waitUntilConnected(ZkClient.java:1910) ~[org.apache.helix.helix-common-1.0.4.jar:?]
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:411) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.manager.zk.ZKHelixManager.getHelixDataAccessor(ZKHelixManager.java:681) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.customizedstate.CustomizedStateProvider.updateCustomizedState(CustomizedStateProvider.java:67) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at org.apache.helix.customizedstate.CustomizedStateProvider.updateCustomizedState(CustomizedStateProvider.java:58) ~[org.apache.helix.helix-core-1.0.4.jar:1.0.4]
at com.linkedin.venice.helix.HelixPartitionStateAccessor.updateReplicaStatus(HelixPartitionStateAccessor.java:34) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.venice.helix.HelixPartitionStatusAccessor.updateReplicaStatus(HelixPartitionStatusAccessor.java:31) ~[com.linkedin.venice.venice-common-0.4.103.jar:?]
at com.linkedin.davinci.notifier.PartitionPushStatusNotifier.started(PartitionPushStatusNotifier.java:20) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.notifier.VeniceNotifier.started(VeniceNotifier.java:17) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.lambda$reportStarted$1(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:73) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.report(IngestionNotificationDispatcher.java:96) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.IngestionNotificationDispatcher.reportStarted(IngestionNotificationDispatcher.java:108) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.lambda$reportStarted$1(StatusReportAdapter.java:84) ~[com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.maybeReportStatus(StatusReportAdapter.java:234) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter$PartitionReportStatus.recordSubPartitionStatus(StatusReportAdapter.java:220) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:141) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.report(StatusReportAdapter.java:132) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StatusReportAdapter.reportStarted(StatusReportAdapter.java:84) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.processStartOfPush(StoreIngestionTask.java:2290) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StoreIngestionTask.produceToStoreBufferServiceOrKafka(StoreIngestionTask.java:991) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:75) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.StorePartitionDataReceiver.write(StorePartitionDataReceiver.java:17) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at com.linkedin.davinci.kafka.consumer.ConsumptionTask.run(ConsumptionTask.java:143) [com.linkedin.venice.da-vinci-client-0.4.103.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly(ReentrantLock.java:159) ~[?:?]
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:372) ~[?:?]
at org.apache.helix.zookeeper.zkclient.ZkClient.acquireEventLock(ZkClient.java:1940) ~[org.apache.helix.helix-common-1.0.4.jar:?]
... 30 more
Code to reproduce bug
No response
What component(s) does this bug affect?
- [ ]
Controller: This is the control-plane for Venice. Used to create/update/query stores and their metadata. - [ ]
Router: This is the stateless query-routing layer for serving read requests. - [X]
Server: This is the component that persists all the store data. - [ ]
VenicePushJob: This is the component that pushes derived data from Hadoop to Venice backend. - [ ]
VenicePulsarSink: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice. - [ ]
Thin Client: This is a stateless client users use to query Venice Router for reading store data. - [ ]
Fast Client: This is a stateful client users use to query Venice Server for reading store data. - [ ]
Da Vinci Client: This is an embedded, stateful client that materializes store data locally. - [ ]
Alpini: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data. - [ ]
Samza: This is the library users use to make nearline updates to store data. - [ ]
Admin Tool: This is the stand-alone client used for ad-hoc operations on Venice. - [ ]
Scripts: These are the various ops scripts in the repo.
We may want to add exponential backoff here to account for any network or ZooKeeper issues.
Opened a PR for this issue here.