starrocks-kubernetes-operator
starrocks-kubernetes-operator copied to clipboard
[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs
Description
Add upgrade sequencing control to the StarRocks Kubernetes Operator to ensure proper component ordering during both initial deployments and upgrades. Previously, the operator always used FE-first ordering, which is correct for initial deployments but incorrect for upgrades. According to StarRocks guidelines:
- Initial Deployment: FE → BEs/CNs (FE must be leader before workers join)
- Cluster Upgrades: BEs/CNs → FE (data nodes upgraded before metadata nodes)
From official documentation:
Upgrade procedure By design, BEs and CNs are backward compatible with the FEs. Therefore, you need to upgrade BEs and CNs first and then FEs to allow your cluster to run properly while being upgraded. Upgrading them in an inverted order may lead to incompatibility between FEs and BEs/CNs, and thereby cause the service to crash.
Solution
Implemented a comprehensive upgrade detection and sequencing mechanism with robust component readiness validation to prevent premature progression between components.
Key Changes
1. Upgrade Detection (isUpgrade())
Detects upgrade scenarios by checking if StatefulSets exist with pending changes, compares Generation vs ObservedGeneration to detect spec changes.
Why this approach?
- Simple and reliable, uses Kubernetes native generation tracking
- Works for any spec change (images, resources, configs)
- Handles transient states, correctly identifies upgrade during rollout progress
2. Controller Ordering (getControllersInOrder())
Dynamically switches controller execution order based on deployment scenario:
- Upgrade scenario:
[be, cn, fe, feproxy](BE-first ordering) - Initial deployment:
[fe, be, cn, feproxy](FE-first ordering)
3. Component Readiness Validation (isComponentReady())
Multi-layer validation to prevent premature component progression by avoiding race conditions, ensuring rollout stability, and providing enhanced logging for debugging.
Logic Flow
Implements waiting logic directly in the reconciliation loop:
Reconcile() called
↓
Get controllers in order based on isUpgrade()
↓
For each controller in order:
↓
├─ If upgrade && feController
│ └─ Check BE/CN ready? → If NO, wait and requeue
│
├─ Sync controller (create/update resources)
│
├─ If initial && feController
│ └─ Check FE ready? → If NO, wait and requeue
│
└─ If upgrade && (beController || cnController)
└─ Check component ready? → If NO, wait and requeue
End-to-End Test Results
Test Case 1: Initial FE+BE Deployment (v3.1.0)
Expected: FE-first ordering (FE must be ready before BE starts)
Timeline:
T0: FE Pod Created - 2025-10-09 07:29:58 UTC
T0+48s: BE Pod Created - 2025-10-09 07:30:46 UTC
Operator Logs:
- "initial deployment: waiting for FE to be ready before creating BE/CN"
- Component progression: feController → wait for FE ready → beController
Verification:
- FE StatefulSet created first
- BE StatefulSet created 48 seconds later (after FE became ready)
Test Case 2: Version Upgrade (v3.1.0 → v3.1.8)
Expected: BE-first ordering (BE must complete before FE starts)
Timeline:
T0: BE Pod Created - 2025-10-09 07:44:45 UTC
T0+6s: FE Pod Created - 2025-10-09 07:44:51 UTC
Operator Logs:
- "component not ready: StatefulSet spec change not yet observed"
- "component not ready: StatefulSet rollout in progress"
- "component not ready: no ready endpoints"
- "upgrade: waiting for component rollout to complete before proceeding"
Verification:
- BE StatefulSet updated first (detected generation change)
- BE pod rolled out to v3.1.8
- FE update waited for BE rollout completion
- FE pod rolled out to v3.1.8 after BE was ready
Test Case 3: Configuration Change (Memory: 2Gi → 4Gi)
Expected: BE-first ordering (config changes treated as upgrades)
Timeline:
T0: BE Pod Created - 2025-10-09 07:47:05 UTC
T0+6s: FE Pod Created - 2025-10-09 07:47:11 UTC
Operator Logs:
- "component not ready: no ready endpoints"
- "upgrade: waiting for component rollout to complete before proceeding"
Verification:
- Configuration change correctly detected as upgrade scenario
- BE rolled out first with new memory limits (4Gi)
- FE waited for BE readiness before rolling out
- Both components running with 4Gi memory
Verification
# Check initial deployment uses FE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "initial deployment: waiting for FE"
# Check upgrade uses BE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "upgrade: waiting for component rollout"
# Verify component readiness details
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "component not ready"
# Check StatefulSet rollout sequence with timestamps
kubectl get statefulsets -n <namespace> -o json | jq -r '.items[] | "\(.metadata.name): \(.status.currentRevision) -> \(.status.updateRevision)"'
# Verify pod creation/recreation timestamps
kubectl get pods -n <namespace> -o custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp
Checklist
For operator, please complete the following checklist:
- [x] run
make generateto generate the code. - [x] run
golangci-lint runto check the code style (0 issues). - [x] run
make testto run UT (all controller tests passing). - [x] run
make manifeststo update the yaml files of CRD.
For helm chart, please complete the following checklist:
- [ ] make sure you have updated the values.yaml file of starrocks chart.
- [ ] In
scriptsdirectory, runbash create-parent-chart-values.shto update the values.yaml file of the parent chart( kube-starrocks chart).
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.
:white_check_mark: yandongxiao
:x: jmjm15x
jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.
The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?
Not with this approach; I encountered a critical race condition during upgrades with pervious implementation I tried (#704). When the operator updated a StatefulSet's spec, it would immediately check component readiness using only endpoint availability. However, endpoints don't immediately reflect the new state - the old pods remain "ready" for a few seconds while Kubernetes starts the rollout. This caused FE to upgrade prematurely, before BE/CN completed their rollouts.
Fix: Implemented isComponentReady() with following validation:
- Service endpoints exist
- StatefulSet controller observed the spec change (
ObservedGenerationcheck) - Rollout is complete (
currentRevision == updateRevision) - All replicas are ready
Implementation approach: I kept the existing logic flow and controller structure intact, only enhancing the readiness checks for robustness. This ensures backward compatibility while fixing the race condition.
I include logs from few E2E tests with proper sequencing (BE/CN → FE) in the description.
The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?
@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.
The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?
@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.
@yandongxiao, sorry for the misunderstanding. Yes, I’ve seen instability during upgrades in our clusters. Our current mitigation is using custom scripts to enforce BE-first ordering for stability, but this workaround risks disrupting the operator workflow.
Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.
Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.
Rebased from the main branch.
Please fix the failed test cases, I think you can run make test in your local computer.
Another question: I think this PR should exclude the third PR.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.1 out of 2 committers have signed the CLA.✅ yandongxiao❌ jmjm15x
jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it.
@jmjm15x Please sign the CLA.
Please fix the failed test cases, I think you can run
make testin your local computer.
Fixed the broken tests in last commit.
Another question: I think this PR should exclude the third PR.
@yandongxiao what do you mean by 3rd PR?
@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?
@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?
@kevincai
Another question: I think this PR should exclude the third PR.
@yandongxiao what do you mean by 3rd PR?
Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.
Another question: I think this PR should exclude the third PR.
@yandongxiao what do you mean by 3rd PR?
Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.
I rebased from main and I think that's the reason for the 3rd commit