rbg
rbg copied to clipboard
[Bug] Duplicate deletion of RoleBasedGroup-managed workloads during fault tolerance
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
Problem description When restartPolicy is set to RecreateRBGOnPodRestart, if a Pod belonging to the underlying RoleBasedGroup (RBG) is restarted, the controller should delete and recreate all workloads managed by that RBG. Under normal circumstances, a single failure should trigger only one delete-and-recreate cycle. However, currently there are multiple extra deletions of RBG-managed workloads in such cases.
Observed logs
{"level":"INFO","time":"2025-10-10T00:02:39.326+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"25b6de64-66f8-4f74-9073-a3aeefd4ef9f"}
{"level":"INFO","time":"2025-10-10T00:02:40.064+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"f17ecabb-cf03-4c8d-a6f2-fd5e7ca9e7ff"}
{"level":"INFO","time":"2025-10-10T00:02:40.547+0800","caller":"workloads/pod_controller.go:64","message":"Recreating RoleBasedGroup","controller":"pod-controller","namespace":"default","name":"restart-policy","reconcileID":"534e5ef6-c591-4eff-9a7b-a4eb19d92024"}
Reproduction
Steps to reproduce
- Deploy the corresponding YAML
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
name: restart-policy
spec:
roles:
- name: sts
restartPolicy: RecreateRBGOnPodRestart
replicas: 2
template:
metadata:
labels:
appVersion: v1
spec:
containers:
- name: sts
image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
ports:
- containerPort: 80
- Manually delete the created Pod
- Observe the operator logs output
Environment
normal