Apache Ignite Rebalancing

Open GhoufranGhazaly opened this issue 1 month ago • 1 comments

I was working on Apache ignitev2.17 and started upgrading to v3.0 . I have 3 server nodes and .net api as a client. I created one zone with Replicas 2 and scaledown = 5 and created a table belong to this zone, then I made one of these nodes offline. the cluster keeps working fine (I can check cluster status, zones,..) but if I do query to data, I got issue like this

SELECT * from t_read_status limit 1;
SQL query execution error
Mandatory nodes was excluded from mapping: [Node3]

 select count(*) from t_read_status;
SQL query execution error
The primary replica await timed out [replicationGroupId=17_part_1, referenceTimestamp=HybridTimestamp [physical=2025-10-28 11:47:20:016 +0000, logical=0, composite=115451628094488576], currentLease=Lease [leaseholder=Node2, leaseholderId=486d7bc8-b03e-4c0f-91a8-76a6bb0dc243, accepted=false, startTime=HybridTimestamp [physical=2025-10-28 11:47:14:874 +0000, logical=95, composite=115451627757502559], expirationTime=HybridTimestamp [physical=2025-10-28 11:49:14:874 +0000, logical=0, composite=115451635621822464], prolongable=true, proposedCandidate=null, replicationGroupId=17_part_1]]

and I checked the online node logs and found

  [WARNING][%Node1%replica-4][ReplicaManager] Failed to process the lease granted message

then I tried to change the cluster config of replication.leaseExpirationInterval to 30000 then I did the test again I have the same result but when I check the node logs found

RebalanceRaftGroupEventsListener] Going to retry rebalance [attemptNo=16908, partId=20_part_18]
[ERROR][%Node1%JRaft-Request-Processor-8][ReplicatorGroupImpl] Fail to check replicator connection to peer=Node2, replicatorType=Follower. 
[RebalanceRaftGroupEventsListener] Number of retries for rebalance exceeded the threshold [partId=17_part_18, threshold=10]

I mentioned some of the logs , and I am lost don't know why the cluster (raft) can't do rebalancing

Oct 29 '25 11:10 GhoufranGhazaly

Hello! Did I understand correctly that you have created zone with replica factor 2 and stopped one of the replica? Could you please clarify which scaledown you've set to 5? We have two types of scaledown: AUTO SCALE UP, AUTO SCALE DOWN (see https://ignite.apache.org/docs/ignite3/latest/administrators-guide/storage/distribution-zones#cluster-scaling)

Nov 14 '25 14:11 alievmirza