volcano
volcano copied to clipboard
Podgroup state changed from running to inqueue after pod deleted
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: job yaml:
apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
name: wide-ande-deep2
spec:
cleanPodPolicy: OnCompletion
withGloo: 1
worker:
replicas: 1
template:
spec:
schedulerName: volcano
containers:
- name: paddle
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
ps:
replicas: 1
template:
spec:
schedulerName: volcano
containers:
- name: paddle
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
In this case(cleanPodPolicy=OnCompletion)οΌwhen the pod is completed, the pods will be deleted by paddlejob controller. The PodGroup status transition will be like Inqueue => Running => Pending(all the pods deleted) => Inqueue(pass enqueue action again)(Then remains Inqueue all the time), it will result in unnecessary resource occupation.
related logic:
func jobStatus(ssn *Session, jobInfo *api.JobInfo) scheduling.PodGroupStatus {
status := jobInfo.PodGroup.Status
unschedulable := false
for _, c := range status.Conditions {
if c.Type == scheduling.PodGroupUnschedulableType &&
c.Status == v1.ConditionTrue &&
c.TransitionID == string(ssn.UID) {
unschedulable = true
break
}
}
// If running tasks && unschedulable, unknown phase
if len(jobInfo.TaskStatusIndex[api.Running]) != 0 && unschedulable {
status.Phase = scheduling.PodGroupUnknown
} else {
allocated := 0
for status, tasks := range jobInfo.TaskStatusIndex {
if api.AllocatedStatus(status) || status == api.Succeeded {
allocated += len(tasks)
}
}
// If there're enough allocated resource, it's running
if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
status.Phase = scheduling.PodGroupRunning
} else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue {
// here PodGroup status converts from Running to Pending
status.Phase = scheduling.PodGroupPending
}
}
status.Running = int32(len(jobInfo.TaskStatusIndex[api.Running]))
status.Failed = int32(len(jobInfo.TaskStatusIndex[api.Failed]))
status.Succeeded = int32(len(jobInfo.TaskStatusIndex[api.Succeeded]))
return status
}
Environment:
- Volcano Version: latest image
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.
Hello π Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! π€ If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hello π Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! π€ If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! π€
My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.
@shinytang6 can you show your code pls? I met same prombles...