helix icon indicating copy to clipboard operation
helix copied to clipboard

Infinite resource balancing issue

Open r0goyal opened this issue 3 years ago • 1 comments

Describe the bug

I have 5 resources in my cluster. Each node in my cluster act both as participant and controller (only one gets elected as controller). I am running a leader follower state model - 1 leader and 1 follower. As long as I am running 1 or 2 nodes, my cluster gets correctly formed and all resources are correctly assigned.

However as soon as I add another node, partition assignment keeps on happening all the time and it never completes.

To Reproduce

Steps to reproduce the behaviour.

Expected behaviour

Cluster should become stable after addition of 3rd node

Additional context

Helix version - 1.0.2 Zookeeper version - 3.4.8-1--1 Application Java version - 1.8

Error logs ERROR [2022-01-31 22:43:43,101] [ZkClient-EventThread-219-localhost:2181/apache-helix-clusters] [HelixTaskExecutor]: Message fd355807-0194-4cc9-ac17-f02cee8debbe cannot be processed: fd355807-0194-4cc9-ac17-f02cee8debbe, {CREATE_TIMESTAMP=1643649222888, ClusterEventName=CurrentStateChange, FROM_STATE=LEADER, MSG_ID=fd355807-0194-4cc9-ac17-f02cee8debbe, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=TE2201281333044107151256_12, RESOURCE_NAME=TE2201281333044107151256, RETRY_COUNT=3, SRC_NAME=e51087cc-2713-4ff2-bcba-5dbffc0f8638, SRC_SESSION_ID=579a797550e896c, STATE_MODEL_DEF=MatchmakerLeaderStandBy, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=82bd5be6-47ba-467d-a53e-f4c6e77d1f0d, TGT_SESSION_ID=579a797550e8970, TO_STATE=STANDBY}{}{}Partition TE2201281333044107151256_12 current state is same as toState (LEADER->STANDBY) from message.

Screenshot of UI Screenshot 2022-02-01 at 9 55 10 AM

Please let me know if any other additional info is required

r0goyal avatar Feb 01 '22 04:02 r0goyal

@r0goyal Are you using default assignment algorithm? If yes, then that's an known issue we found as flipflop assignment. Suggest you to use CRUSHED based assignment. Or if you are in 1.0+, you can try WAGED.

junkaixue avatar Mar 06 '22 22:03 junkaixue