[Bug] Duplicate deletion of RoleBasedGroup-managed workloads during fault tolerance

Open bcfre opened this issue 3 months ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

Problem description When restartPolicy is set to RecreateRBGOnPodRestart, if a Pod belonging to the underlying RoleBasedGroup (RBG) is restarted, the controller should delete and recreate all workloads managed by that RBG. Under normal circumstances, a single failure should trigger only one delete-and-recreate cycle. However, currently there are multiple extra deletions of RBG-managed workloads in such cases.

Observed logs

{"level":"INFO","time":"2025-10-10T00:02:39.326+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"25b6de64-66f8-4f74-9073-a3aeefd4ef9f"}
{"level":"INFO","time":"2025-10-10T00:02:40.064+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"f17ecabb-cf03-4c8d-a6f2-fd5e7ca9e7ff"}
{"level":"INFO","time":"2025-10-10T00:02:40.547+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"534e5ef6-c591-4eff-9a7b-a4eb19d92024"}

Reproduction

Steps to reproduce

Deploy the corresponding YAML

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: restart-policy
spec:
  roles:
    - name: sts
      restartPolicy: RecreateRBGOnPodRestart
      replicas: 2
      template:
        metadata:
          labels:
            appVersion: v1
        spec:
          containers:
            - name: sts
              image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
              ports:
                - containerPort: 80

Manually delete the created Pod
Observe the operator logs output

Environment

normal

Oct 09 '25 16:10 bcfre