helix icon indicating copy to clipboard operation
helix copied to clipboard

Ambari metrics facing issues with helix-core version 1.3.2 / 1.4.3

Open vishalsuvagia opened this issue 3 months ago • 3 comments

Describe the bug

Apache Ambari Metrics is using Helix for cluster management tasks. Recently tried to upgrade the Helix dependency from 0.6.6 to 1.3.2 / 1.4.3; however, we are seeing a failure in Metrics Collector startup when the Hadoop cluster is deployed in kerberos enabled mode with the newer version of Helix.

Based on the investigation, I would like to pin down the issues because of the change in the Helix Core Zk initialisation which fails to create the zookeeper client and service shutdown is triggered with below error in the trace. Have checked and confirm for the zookeeper connectivity and the znode ambari-metrics-cluster to be present with node information.

2025-09-17 10:54:29,633 WARN org.apache.helix.manager.zk.ZKHelixManager: zkClient to testnode01.mycluster.org:2181 is not connected, wait for 10000ms. 2025-09-17 10:54:39,635 ERROR org.apache.helix.manager.zk.ZKHelixManager: zkClient is not connected after waiting 10000ms., > clusterName: ambari-metrics-cluster, zkAddress: testnode01.mycluster.org:2181 ERROR org.apache.helix.manager.zk.ZKHelixManager: fail to createClient. retry 1 org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster ambari-metrics-cluster at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417) at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:687) at org.apache.helix.manager.zk.ParticipantManager.(ParticipantManager.java:118) at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1440) at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1390) at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782) at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:817) at org.apache.ambari.metrics.core.timeline.availability.AggregationTaskRunner.initialize(AggregationTaskRunner.java:135) at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.startAggregators(MetricCollectorHAController.java:205) at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.initializeHAController(MetricCollectorHAController.java:184) at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.initializeSubsystem(HBaseTimelineMetricsService.java:133) at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.serviceInit(HBaseTimelineMetricsService.java:102)

I am trying to understand the change in behaviour from the library side and appropriate fix for the issue and tried few approaches by trying to set zk timeout with system properties, -D arguments and setting helix.zk session and connection timeouts, rewriting ZkHelixManager object initialisation by adding a RealmAwareZkClient, RealmAwareZkClientConfig, CloudConfig and HelixManagerProperty object instances using required parameters, but so far none seem to have worked. Request to kindly help and guide with an appropriate fix for the issue. For reference(Apache Ambari Metrics Helix upgrade https://github.com/apache/ambari-metrics/pull/173) and (JDK-17 Support https://github.com/apache/ambari-metrics/pull/134)

cc: @jackjlli / @Jackie-Jiang

To Reproduce

Steps to reproduce the behavior.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

vishalsuvagia avatar Sep 22 '25 19:09 vishalsuvagia

Hi Team, request to kindly help on resolving the issue.

vishalsuvagia avatar Sep 26 '25 07:09 vishalsuvagia

Hi @jackjlli / @Jackie-Jiang, request to kindly help in resolving the issue.

vishalsuvagia avatar Oct 08 '25 12:10 vishalsuvagia

Hi @jackjlli / @Jackie-Jiang , request to kindly help in resolving the issue.

vishalsuvagia avatar Nov 12 '25 06:11 vishalsuvagia