starrocks-kubernetes-operator icon indicating copy to clipboard operation
starrocks-kubernetes-operator copied to clipboard

Rolling Restart should consider starrocks cluster status

Open milletnis opened this issue 1 year ago • 0 comments

Describe the current behavior

Currently when we do a "rolling restart" of the cluster the operator is restarting the pods independent of whether the starrocks cluster is in a clean state or not. This leads to the problem we are facing WRITE errors with "under-replicated" tablets during rolling restarts because cluster ist still syncing tablets while operator is removing next BE pod

Currently we do manual DELETE POD instead of rolling restart and watch out for "pending tablets" on the cluster. We go with next pod if "pending tablets = 0" -> See example below

PROD > SHOW PROC '/cluster_balance';
+-------------------+--------+
| Item              | Number |
+-------------------+--------+
| cluster_load_stat | 1      |
| working_slots     | 6      |
| sched_stat        | 1      |
| priority_repair   | 0      |
| pending_tablets   | 185    |
| running_tablets   | 32     |
| history_tablets   | 1000   |
| all_tablets       | 217    |
+-------------------+--------+
8 rows in set (0.06 sec)

Describe the enhancement

Operator should consider the "health/balance" state of the cluster and should only go on with removing of PODs if cluster is in sync. Not sure if "pending_tablets" ist the best approach but should definitely avoid tablets which are not writable during restarts

milletnis avatar Jan 23 '24 10:01 milletnis