CloudShuffleService icon indicating copy to clipboard operation
CloudShuffleService copied to clipboard

Can zk support high-frequency operations, and whether zk will be a bottleneck?

Open long1208 opened this issue 1 year ago • 2 comments

long1208 avatar Aug 29 '22 03:08 long1208

the operations for zk are not very frequently, one is that css workers update status, another is the create/delete operations when register/unregister shuffle. Currently we have 7 zk nodes for a CSS cluster which has hundreds of workers. The zk pressure(memory) mainly comes from lots of zk watches, which we used for tracking the shuffleId lifetime to clean data on css workers. We are doing the optimization for this.

bdyx123 avatar Aug 29 '22 04:08 bdyx123

@bdyx123 , have you seen the following exceptions from Spark application logs? It seems CSS worker has deleted the shuffleID before it tries to update it. Is this behavior normal?

ERROR ZookeeperExternalShuffleMeta: Update zk shuffle node spark-00000003119c3rhae9v-117 failed.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /css/mycss/shuffles/spark-00000003119c3rhae9v-117

a140262 avatar Oct 30 '22 23:10 a140262