helix icon indicating copy to clipboard operation
helix copied to clipboard

Helix Controller Does Not Rebalance Partitions After A Participant Loss

Open skishtapuram-loyaltymethods opened this issue 6 months ago • 0 comments

Problem

We are using Apache Helix to distribute a resource across multiple services - all registered under the same Helix cluster name and The system is deployed on AWS ECS Fargate using Docker containers.

Each service uses the OnlineOffline state model, with 1 replica per partition and the FULL_AUTO rebalance mode.

When all services start, Helix assigns partitions as expected and all services consume their assigned partitions. The problem occurs when a service is restarted - due to an AWS Spot Termination which appear to forcefully kill the container, bypassing the logic we have to handle SIGTERM

So when a service is restarted from the cluster, the partition previously assigned to the now-stopped node are not reassigned to new live nodes. As a result, that partition remains Idle, even though other nodes are available and registered in live instances.

I have tried with both Apache Helix version 0.9.9 and the latest 1.4.3, but the problem still persists in both versions. Also When a node is removed from the cluster, there are no logs from the controller instance indicating that it is attempting to rebalance the partitions.

Additional Context After ZooKeeper Inspection

  • /LIVEINSTANCES is perfectly listing only actively running nodes.
  • But in /EXTERNALVIEW for that partition still shows the stopped node as ONLINE and does not reassign it to a live node.
  • This mismatch persists even after new nodes are registered as live instances.

Expected behavior

Helix should detect that the node is no longer live and automatically reassign its partitions to other available nodes in the cluster.

What worked for us

But I have been able to temporarily resolve the issue by performing a manual restart of all services register under that HELIX CLUSTER. After the manual restart, all partitions are perfectly reassigned.

I will be active here and will respond to all your replies as soon as possible and I am happy to provide any additional information or logs that would help with debugging this issue.