starrocks-kubernetes-operator icon indicating copy to clipboard operation
starrocks-kubernetes-operator copied to clipboard

[Bugfix] Implement upgrade-aware controller ordering for FE/BEs/CNs

Open jmjm15x opened this issue 1 month ago • 16 comments
trafficstars

Description

Add upgrade sequencing control to the StarRocks Kubernetes Operator to ensure proper component ordering during both initial deployments and upgrades. Previously, the operator always used FE-first ordering, which is correct for initial deployments but incorrect for upgrades. According to StarRocks guidelines:

  • Initial Deployment: FE → BEs/CNs (FE must be leader before workers join)
  • Cluster Upgrades: BEs/CNs → FE (data nodes upgraded before metadata nodes)

From official documentation:

Upgrade procedure By design, BEs and CNs are backward compatible with the FEs. Therefore, you need to upgrade BEs and CNs first and then FEs to allow your cluster to run properly while being upgraded. Upgrading them in an inverted order may lead to incompatibility between FEs and BEs/CNs, and thereby cause the service to crash.

Solution

Implemented a comprehensive upgrade detection and sequencing mechanism with robust component readiness validation to prevent premature progression between components.

Key Changes

1. Upgrade Detection (isUpgrade())

Detects upgrade scenarios by checking if StatefulSets exist with pending changes, compares Generation vs ObservedGeneration to detect spec changes.

Why this approach?

  • Simple and reliable, uses Kubernetes native generation tracking
  • Works for any spec change (images, resources, configs)
  • Handles transient states, correctly identifies upgrade during rollout progress

2. Controller Ordering (getControllersInOrder())

Dynamically switches controller execution order based on deployment scenario:

  • Upgrade scenario: [be, cn, fe, feproxy] (BE-first ordering)
  • Initial deployment: [fe, be, cn, feproxy] (FE-first ordering)

3. Component Readiness Validation (isComponentReady())

Multi-layer validation to prevent premature component progression by avoiding race conditions, ensuring rollout stability, and providing enhanced logging for debugging.

Logic Flow

Implements waiting logic directly in the reconciliation loop:

Reconcile() called
    ↓
Get controllers in order based on isUpgrade()
    ↓
For each controller in order:
    ↓
    ├─ If upgrade && feController
    │   └─ Check BE/CN ready? → If NO, wait and requeue
    │
    ├─ Sync controller (create/update resources)
    │
    ├─ If initial && feController  
    │   └─ Check FE ready? → If NO, wait and requeue
    │
    └─ If upgrade && (beController || cnController)
        └─ Check component ready? → If NO, wait and requeue

End-to-End Test Results

Test Case 1: Initial FE+BE Deployment (v3.1.0)

Expected: FE-first ordering (FE must be ready before BE starts)

Timeline:
  T0:       FE Pod Created - 2025-10-09 07:29:58 UTC
  T0+48s:   BE Pod Created - 2025-10-09 07:30:46 UTC

Operator Logs:
  - "initial deployment: waiting for FE to be ready before creating BE/CN"
  - Component progression: feController → wait for FE ready → beController

Verification:
  - FE StatefulSet created first
  - BE StatefulSet created 48 seconds later (after FE became ready)

Test Case 2: Version Upgrade (v3.1.0 → v3.1.8)

Expected: BE-first ordering (BE must complete before FE starts)

Timeline:
  T0:       BE Pod Created - 2025-10-09 07:44:45 UTC
  T0+6s:    FE Pod Created - 2025-10-09 07:44:51 UTC

Operator Logs:
  - "component not ready: StatefulSet spec change not yet observed"
  - "component not ready: StatefulSet rollout in progress"
  - "component not ready: no ready endpoints"
  - "upgrade: waiting for component rollout to complete before proceeding"

Verification:
  - BE StatefulSet updated first (detected generation change)
  - BE pod rolled out to v3.1.8
  - FE update waited for BE rollout completion
  - FE pod rolled out to v3.1.8 after BE was ready

Test Case 3: Configuration Change (Memory: 2Gi → 4Gi)

Expected: BE-first ordering (config changes treated as upgrades)

Timeline:
  T0:       BE Pod Created - 2025-10-09 07:47:05 UTC
  T0+6s:    FE Pod Created - 2025-10-09 07:47:11 UTC

Operator Logs:
  - "component not ready: no ready endpoints"
  - "upgrade: waiting for component rollout to complete before proceeding"

Verification:
  - Configuration change correctly detected as upgrade scenario
  - BE rolled out first with new memory limits (4Gi)
  - FE waited for BE readiness before rolling out
  - Both components running with 4Gi memory

Verification

# Check initial deployment uses FE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "initial deployment: waiting for FE"

# Check upgrade uses BE-first (wait message)
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "upgrade: waiting for component rollout"

# Verify component readiness details
kubectl logs -n <namespace> deployment/kube-starrocks-operator | grep "component not ready"

# Check StatefulSet rollout sequence with timestamps
kubectl get statefulsets -n <namespace> -o json | jq -r '.items[] | "\(.metadata.name): \(.status.currentRevision) -> \(.status.updateRevision)"'

# Verify pod creation/recreation timestamps
kubectl get pods -n <namespace> -o custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp

Checklist

For operator, please complete the following checklist:

  • [x] run make generate to generate the code.
  • [x] run golangci-lint run to check the code style (0 issues).
  • [x] run make test to run UT (all controller tests passing).
  • [x] run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

  • [ ] make sure you have updated the values.yaml file of starrocks chart.
  • [ ] In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent chart( kube-starrocks chart).

jmjm15x avatar Oct 09 '25 16:10 jmjm15x

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: yandongxiao
:x: jmjm15x


jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Oct 09 '25 16:10 CLAassistant

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

yandongxiao avatar Oct 09 '25 16:10 yandongxiao

Not with this approach; I encountered a critical race condition during upgrades with pervious implementation I tried (#704). When the operator updated a StatefulSet's spec, it would immediately check component readiness using only endpoint availability. However, endpoints don't immediately reflect the new state - the old pods remain "ready" for a few seconds while Kubernetes starts the rollout. This caused FE to upgrade prematurely, before BE/CN completed their rollouts.

Fix: Implemented isComponentReady() with following validation:

  1. Service endpoints exist
  2. StatefulSet controller observed the spec change (ObservedGeneration check)
  3. Rollout is complete (currentRevision == updateRevision)
  4. All replicas are ready

Implementation approach: I kept the existing logic flow and controller structure intact, only enhancing the readiness checks for robustness. This ensures backward compatibility while fixing the race condition.

I include logs from few E2E tests with proper sequencing (BE/CN → FE) in the description.

jmjm15x avatar Oct 09 '25 17:10 jmjm15x

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.

yandongxiao avatar Oct 10 '25 17:10 yandongxiao

The upgrade sequence you mentioned is indeed a problem, as it doesn't follow the rules. I'd like to ask, have you encountered any issues during this upgrade process?

@jmjm15x What I want to express here is, when you use the current version of the operator and upgrade FE and BE simultaneously, did you encounter any issues? Currently, we have not received any other reports of issues caused by simultaneous FE/BE upgrades.

@yandongxiao, sorry for the misunderstanding. Yes, I’ve seen instability during upgrades in our clusters. Our current mitigation is using custom scripts to enforce BE-first ordering for stability, but this workaround risks disrupting the operator workflow.

jmjm15x avatar Oct 16 '25 00:10 jmjm15x

Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.

yandongxiao avatar Oct 16 '25 17:10 yandongxiao

Please rebase your code to follow the latest main code. Some auto-tests was missing to be exectued, and I have fixed it.

Rebased from the main branch.

jmjm15x avatar Oct 20 '25 16:10 jmjm15x

Please fix the failed test cases, I think you can run make test in your local computer.

yandongxiao avatar Oct 20 '25 17:10 yandongxiao

Another question: I think this PR should exclude the third PR.

yandongxiao avatar Oct 20 '25 17:10 yandongxiao

CLA assistant check Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.1 out of 2 committers have signed the CLA.✅ yandongxiao❌ jmjm15x

jmjm15x seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it.

@jmjm15x Please sign the CLA.

yandongxiao avatar Oct 20 '25 17:10 yandongxiao

Please fix the failed test cases, I think you can run make test in your local computer.

Fixed the broken tests in last commit.

jmjm15x avatar Oct 22 '25 20:10 jmjm15x

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

jmjm15x avatar Oct 22 '25 20:10 jmjm15x

@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?

jmjm15x avatar Oct 22 '25 20:10 jmjm15x

@yandongxiao I noticed a typo in my CLA signature, that's why it's pending. Could you revoke it so I can sign again?

@kevincai

yandongxiao avatar Oct 22 '25 21:10 yandongxiao

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.

yandongxiao avatar Oct 22 '25 21:10 yandongxiao

Another question: I think this PR should exclude the third PR.

@yandongxiao what do you mean by 3rd PR?

Sorry, there are errors in what I wrote. Now the PR contains four commits, but one commit is from yandongxiao.

I rebased from main and I think that's the reason for the 3rd commit

jmjm15x avatar Oct 22 '25 21:10 jmjm15x