starrocks-kubernetes-operator
starrocks-kubernetes-operator copied to clipboard
Rolling Restart should consider starrocks cluster status
Describe the current behavior
Currently when we do a "rolling restart" of the cluster the operator is restarting the pods independent of whether the starrocks cluster is in a clean state or not. This leads to the problem we are facing WRITE errors with "under-replicated" tablets during rolling restarts because cluster ist still syncing tablets while operator is removing next BE pod
Currently we do manual DELETE POD instead of rolling restart and watch out for "pending tablets" on the cluster. We go with next pod if "pending tablets = 0" -> See example below
PROD > SHOW PROC '/cluster_balance';
+-------------------+--------+
| Item | Number |
+-------------------+--------+
| cluster_load_stat | 1 |
| working_slots | 6 |
| sched_stat | 1 |
| priority_repair | 0 |
| pending_tablets | 185 |
| running_tablets | 32 |
| history_tablets | 1000 |
| all_tablets | 217 |
+-------------------+--------+
8 rows in set (0.06 sec)
Describe the enhancement
Operator should consider the "health/balance" state of the cluster and should only go on with removing of PODs if cluster is in sync. Not sure if "pending_tablets" ist the best approach but should definitely avoid tablets which are not writable during restarts