Lari Hotari
Lari Hotari
I created a PR #195 which adds better logging to CI which would help investigating CI failures. I have observed the "ZK TLS Only" CI job failing, presumably with the...
I have created #202 as a workaround for the Zookeeper issue (when TLS is enabled).
Here are logs for the ZK TLS Only failures filtered for pulsar-metadata and zookeeper logs: https://gist.githubusercontent.com/lhotari/200aae3dcb54912d0b5b5958ffb5fe13/raw/fd833e90307ce9e9e5f9084143450ab0de575d34/pulsar-ci-pulsar-init-and-zk-logs.txt from https://github.com/apache/pulsar-helm-chart/runs/4883211151?check_suite_focus=true
based on the logs, it looks like the `pulsar-bookkeeper-verify-clusterid` container fails: https://github.com/apache/pulsar-helm-chart/blob/a919f309c6d73342196dbaf6bf146cfda8d9e8e8/charts/pulsar/templates/pulsar-cluster-initialize.yaml#L61-L73 . I wonder why `pulsar-bookkeeper-verify-clusterid`. ``` [pod/pulsar-ci-pulsar-init-pzc88/pulsar-bookkeeper-verify-clusterid] 14:22:08.341 [main-SendThread(pulsar-ci-zookeeper:2281)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server pulsar-ci-zookeeper/10.244.1.11:2281....
here's the [bookie-init failure](https://gist.githubusercontent.com/lhotari/1762ae76483596a3d49d9c0df1e4bb6d/raw/cdb58eb085b336fa961e9283807698552ca536ee/pulsar-ci-bookie-init-and-zk-logs.txt): ``` 14:45:47.975 [main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes 14:45:47.983 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false 14:45:48.046 [main-SendThread(pulsar-ci-zookeeper:2281)] INFO org.apache.zookeeper.ClientCnxn...
It's possible that the Zookeeper issue is simply caused by the probe getting stuck. The changes in #179 fix the issue for 1.20+ . I'll send a separate PR to...
Rebased after #214 changes. Let's see if the Zookeeper TLS tests pass now.
All tests pass now. I'll inform about this [on the dev mailing list thread](https://lists.apache.org/thread/619tpn6q5xbbhngwsmhtq3121vhjxpt4).
Closing and re-opening to run the tests one more time to see that the problem is fixed.
I'm hoping to get more logs from the failure after #215 changes are in place in CI.