volcano
volcano copied to clipboard
Can not apply a vcjob with volcano-release-1.6
What happened:
I uninstall volcano-1.5.1 by kubectl delete -f ./volcano-1.5.1/volcano-development.yaml
, and reinstall volcano-release-1.6 by kubectl apply -f ./volcano-release-1.6/volcano-development.yaml
. When I apply a vcjob reference the step 2 of https://volcano.sh/en/docs/tutorials/
, kubectl get node
output No resources found in default namespace
, and the vcjob status is pending as fllow.
apiVersion: v1
items:
- apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"name":"job-1","namespace":"default"},"spec":{"minAvailable":1,"policies":[{"action":"RestartJob","event":"PodEvicted"}],"queue":"test","schedulerName":"volcano","tasks":[{"name":"nginx","policies":[{"action":"CompleteJob","event":"TaskCompleted"}],"replicas":1,"template":{"spec":{"containers":[{"command":["sleep","10m"],"image":"nginx:latest","name":"nginx","resources":{"limits":{"cpu":1},"requests":{"cpu":1}}}],"restartPolicy":"Never"}}}]}}
creationTimestamp: "2022-06-13T10:26:42Z"
generation: 1
name: job-1
namespace: default
resourceVersion: "6854386"
uid: 16b729d3-085d-4747-86b6-0ceb614b906e
spec:
maxRetry: 3
minAvailable: 1
policies:
- action: RestartJob
event: PodEvicted
queue: test
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: nginx
policies:
- action: CompleteJob
event: TaskCompleted
replicas: 1
template:
metadata: {}
spec:
containers:
- command:
- sleep
- 10m
image: nginx:latest
name: nginx
resources:
limits:
cpu: "1"
requests:
cpu: "1"
restartPolicy: Never
status:
conditions:
- lastTransitionTime: "2022-06-13T10:26:44Z"
status: Pending
minAvailable: 1
state:
lastTransitionTime: "2022-06-13T10:26:44Z"
phase: Pending
kind: List
metadata:
resourceVersion: ""
selfLink: ""
What you expected to happen: this vcjob can run normally as voclano-1.5.1.
How to reproduce it (as minimally and precisely as possible):
- install volcano-release-1.6.
- apply a vcjob as step 2 of
https://volcano.sh/en/docs/tutorials/
.
Anything else we need to know?: Is it related to uninstall volcano-1.5.1?
Environment:
- Volcano Version: 1.6
- Kubernetes version (use
kubectl version
): 1.21.3 - Cloud provider or hardware configuration: server machines with 4 Nvidia V100
- OS (e.g. from /etc/os-release): CentOS Linux release 7.6.1810 (Core)
- Kernel (e.g.
uname -a
): 3.10.0-957.5.1.el7.x86_64 - Install tools: kubectl apply -f volcano-development.yaml
- Others:
/assign @Thor-wl Please help to take a look.
Please check the status of volcano components
kubectl get po -n volcano-system
If possible, please attach the logs of the volcano components
any progress for this issue? Can we reproduce it?
@kongjibai Can you give more details about the scenario? For example, please execute kubectl describe vcjob job-1
to see the status and the corresponding podgroup status.
Please check the status of volcano components
kubectl get po -n volcano-system
If possible, please attach the logs of the volcano components
sorry, it's a long time no reply. it outputs as below
NAME READY STATUS RESTARTS AGE
volcano-admission-6c68cbbf98-s6twp 1/1 Running 0 8m14s
volcano-admission-init-nnbfq 0/1 Completed 0 8m14s
volcano-controllers-f4b69577b-99cfp 1/1 Running 0 8m14s
volcano-scheduler-c98cb745b-kqgpr 1/1 Running 0 8m14s
kubectl describe vcjob job-1
sorry, it's a long time no reply. it outputs as below, reminds pod group is not ready. it's normal in volcano-release-1.5, but failed in volcano-release-1.6. How can I sovle this problem?
Name: job-1
Namespace: default
Labels: <none>
Annotations: <none>
API Version: batch.volcano.sh/v1alpha1
Kind: Job
Metadata:
Creation Timestamp: 2022-08-16T09:30:04Z
Generation: 1
Managed Fields:
API Version: batch.volcano.sh/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
f:minAvailable:
f:state:
.:
f:lastTransitionTime:
f:phase:
Manager: Go-http-client
Operation: Update
Time: 2022-08-16T09:30:04Z
API Version: batch.volcano.sh/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:minAvailable:
f:policies:
f:queue:
f:schedulerName:
f:tasks:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-08-16T09:30:04Z
Resource Version: 5063533
UID: 2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
Spec:
Max Retry: 3
Min Available: 1
Policies:
Action: RestartJob
Event: PodEvicted
Queue: test
Scheduler Name: volcano
Tasks:
Max Retry: 3
Min Available: 1
Name: nginx
Policies:
Action: CompleteJob
Event: TaskCompleted
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Command:
sleep
10m
Image: nginx:latest
Name: nginx
Resources:
Limits:
Cpu: 1
Requests:
Cpu: 1
Restart Policy: Never
Status:
Conditions:
Last Transition Time: 2022-08-16T09:30:10Z
Status: Pending
Min Available: 1
State:
Last Transition Time: 2022-08-16T09:30:10Z
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning PodGroupPending 2m6s vc-controller-manager PodGroup default:job-1 unschedule,reason: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
This output provides little useful information for debug. Have you described the podgroup for more details or take a search of the logs?
This output provides little useful information for debug. Have you described the podgroup for more details or take a search of the logs?
the podgroup described as below, it reminds NotEnoughResources
, but i'm sure the k8s cluster has enought resource, including cpu, memory and gpu. because everything is ok in volcano-release-1.5.
Name: job-1-2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
Namespace: default
Labels: <none>
Annotations: <none>
API Version: scheduling.volcano.sh/v1beta1
Kind: PodGroup
Metadata:
Creation Timestamp: 2022-08-16T09:30:04Z
Generation: 955
Managed Fields:
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:ownerReferences:
.:
k:{"uid":"2cadba62-3f14-4c59-b09b-5bfcb8cce0d5"}:
.:
f:apiVersion:
f:blockOwnerDeletion:
f:controller:
f:kind:
f:name:
f:uid:
f:spec:
.:
f:minMember:
f:minResources:
.:
f:count/pods:
f:cpu:
f:limits.cpu:
f:pods:
f:requests.cpu:
f:minTaskMember:
.:
f:nginx:
f:queue:
f:status:
Manager: Go-http-client
Operation: Update
Time: 2022-08-16T09:30:04Z
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:phase:
Manager: vc-scheduler
Operation: Update
Time: 2022-08-16T09:30:05Z
Owner References:
API Version: batch.volcano.sh/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Job
Name: job-1
UID: 2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
Resource Version: 5159589
UID: 95e8e0c3-12c1-4455-9edd-aa4e298800d9
Spec:
Min Member: 1
Min Resources:
count/pods: 1
Cpu: 1
limits.cpu: 1
Pods: 1
requests.cpu: 1
Min Task Member:
Nginx: 1
Queue: test
Status:
Conditions:
Last Transition Time: 2022-08-17T03:00:41Z
Message: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
Reason: NotEnoughResources
Status: True
Transition ID: 97f14a76-f54d-441d-992d-7b732621e7a3
Type: Unschedulable
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 58s (x62798 over 17h) volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗