pulsar
pulsar copied to clipboard
[Bug][meta] Broker enter the loop to recreate zkSession failed, cause broker stuck
Search before asking
- [X] I searched in the issues and found nothing similar.
Read release policy
- [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
zookeeper: 3.6.3 broker: 3.0.4 bookkeeper: 4.16.4
Minimal reproduce step
continue to do perf-produce to a 5-nodes cluster, with several broker restart operation.
What did you expect to see?
broker should reCreate zkSession success and not stuck.
What did you see instead?
Firstly, everything is ok. After running for some time, notice one of the five broker have no throughput, but broker process is alive.
Then it is found in log, the broker always retry recreate zkSession frequently but failed.
15:55:28.066 [ZKC-connect-executor-0-SendThread(ip3:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip3/ip3:2181, session id = 0x3016bf7f9583a32, negotiated timeout = 30000
15:55:28.066 [ZKC-connect-executor-0-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.066 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.066 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.066 [metadata-store-38-1] INFO org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.066 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.066 [ZKC-connect-executor-0-EventThread] INFO org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 3016bf7f9583a32 is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.066 [metadata-store-38-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.066 [metadata-store-coordination-service-47-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.066 [metadata-store-coordination-service-64-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.066 [metadata-store-coordination-service-40-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.172 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Session: 0x3016bf7f9583a32 closed
15:55:28.406 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.173 [ZKC-connect-executor-0-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3016bf7f9583a32
15:55:28.186 [ZKC-connect-executor-0-SendThread(ip2:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip2/ip2:2181, session id = 0x20d399292970a9d, negotiated timeout = 30000
15:55:28.186 [ZKC-connect-executor-0-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.186 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.186 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.186 [metadata-store-38-1] INFO org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.186 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.186 [ZKC-connect-executor-0-EventThread] INFO org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 20d399292970a9d is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.186 [metadata-store-38-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.186 [metadata-store-coordination-service-47-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.186 [metadata-store-coordination-service-40-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.186 [metadata-store-coordination-service-64-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.292 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Session: 0x20d399292970a9d closed
15:55:28.292 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeep
er.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.294 [ZKC-connect-executor-0-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x20d399292970a9d
15:55:28.300 [ZKC-connect-executor-0-SendThread(ip3:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server ip3/ip3:2181, session id = 0x3016bf7f9583a33, negotiated timeout = 30000
15:55:28.300 [ZKC-connect-executor-0-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
15:55:28.300 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - Got ZK session watch event: WatchedEvent state:SyncConnected type:None path:null zxid: -1
15:55:28.300 [metadata-store-38-1] INFO org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client reconnection with server quorum. Current status: SessionLost
15:55:28.300 [metadata-store-38-1] INFO org.apache.pulsar.broker.PulsarService - Received metadata service session event: Reconnected
15:55:28.300 [ZKC-connect-executor-0-EventThread] ERROR org.apache.pulsar.metadata.impl.ZKMetadataStore - Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS
15:55:28.300 [ZKC-connect-executor-0-EventThread] INFO org.apache.pulsar.metadata.impl.PulsarZooKeeperClient - ZooKeeper session 3016bf7f9583a33 is expired from ip1:2181,ip2:2181,ip3:2181.
15:55:28.300 [metadata-store-38-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: Reconnected
15:55:28.300 [metadata-store-coordination-service-47-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.300 [metadata-store-coordination-service-40-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.300 [metadata-store-coordination-service-64-1] INFO org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Metadata store connection has been re-established. Revalidating locks that were pending.
15:55:28.406 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Session: 0x3016bf7f9583a33 closed
15:55:28.406 [ZKC-connect-executor-0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=ip1:2181,ip2:2181,ip3:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@1a78dacd
15:55:28.408 [ZKC-connect-executor-0-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3016bf7f9583a33
There is some questions.
- 3-node Zk server is all runnable, it seems everything of zk server is OK. Actually the other 4 broker is OK, only 1 broker enter the loop to recreate ZkSession. And all Brokers located machine's load are low. I guess restart broker can fix this issue, but I do not do that, for the reason to preserve the problem site. But actually broker should recover by itself.
- Which case would result in this error log "Failed to recreate persistent watch on ZooKeeper: CONNECTIONLOSS" ?
- Currently I prefer that this stuck is because of broker's way to reCreate zk session, but not zk itself.
- I don't know whether should I update the zk version with pulsar version 3.0.
Anything else?
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!