volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Can not apply a vcjob with volcano-release-1.6

Open kongjibai opened this issue 2 years ago • 8 comments

What happened: I uninstall volcano-1.5.1 by kubectl delete -f ./volcano-1.5.1/volcano-development.yaml, and reinstall volcano-release-1.6 by kubectl apply -f ./volcano-release-1.6/volcano-development.yaml. When I apply a vcjob reference the step 2 of https://volcano.sh/en/docs/tutorials/, kubectl get node output No resources found in default namespace, and the vcjob status is pending as fllow.

apiVersion: v1
items:
- apiVersion: batch.volcano.sh/v1alpha1
  kind: Job
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"name":"job-1","namespace":"default"},"spec":{"minAvailable":1,"policies":[{"action":"RestartJob","event":"PodEvicted"}],"queue":"test","schedulerName":"volcano","tasks":[{"name":"nginx","policies":[{"action":"CompleteJob","event":"TaskCompleted"}],"replicas":1,"template":{"spec":{"containers":[{"command":["sleep","10m"],"image":"nginx:latest","name":"nginx","resources":{"limits":{"cpu":1},"requests":{"cpu":1}}}],"restartPolicy":"Never"}}}]}}
    creationTimestamp: "2022-06-13T10:26:42Z"
    generation: 1
    name: job-1
    namespace: default
    resourceVersion: "6854386"
    uid: 16b729d3-085d-4747-86b6-0ceb614b906e
  spec:
    maxRetry: 3
    minAvailable: 1
    policies:
    - action: RestartJob
      event: PodEvicted
    queue: test
    schedulerName: volcano
    tasks:
    - maxRetry: 3
      minAvailable: 1
      name: nginx
      policies:
      - action: CompleteJob
        event: TaskCompleted
      replicas: 1
      template:
        metadata: {}
        spec:
          containers:
          - command:
            - sleep
            - 10m
            image: nginx:latest
            name: nginx
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "1"
          restartPolicy: Never
  status:
    conditions:
    - lastTransitionTime: "2022-06-13T10:26:44Z"
      status: Pending
    minAvailable: 1
    state:
      lastTransitionTime: "2022-06-13T10:26:44Z"
      phase: Pending
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

What you expected to happen: this vcjob can run normally as voclano-1.5.1.

How to reproduce it (as minimally and precisely as possible):

  1. install volcano-release-1.6.
  2. apply a vcjob as step 2 of https://volcano.sh/en/docs/tutorials/.

Anything else we need to know?: Is it related to uninstall volcano-1.5.1?

Environment:

  • Volcano Version: 1.6
  • Kubernetes version (use kubectl version): 1.21.3
  • Cloud provider or hardware configuration: server machines with 4 Nvidia V100
  • OS (e.g. from /etc/os-release): CentOS Linux release 7.6.1810 (Core)
  • Kernel (e.g. uname -a): 3.10.0-957.5.1.el7.x86_64
  • Install tools: kubectl apply -f volcano-development.yaml
  • Others:

kongjibai avatar Jun 13 '22 10:06 kongjibai

/assign @Thor-wl Please help to take a look.

william-wang avatar Jun 14 '22 08:06 william-wang

Please check the status of volcano components

kubectl get po -n volcano-system

If possible, please attach the logs of the volcano components

hwdef avatar Jun 14 '22 09:06 hwdef

any progress for this issue? Can we reproduce it?

william-wang avatar Jun 27 '22 02:06 william-wang

@kongjibai Can you give more details about the scenario? For example, please execute kubectl describe vcjob job-1 to see the status and the corresponding podgroup status.

Thor-wl avatar Jul 05 '22 06:07 Thor-wl

Please check the status of volcano components

kubectl get po -n volcano-system

If possible, please attach the logs of the volcano components

sorry, it's a long time no reply. it outputs as below

NAME                                  READY   STATUS      RESTARTS   AGE
volcano-admission-6c68cbbf98-s6twp    1/1     Running     0          8m14s
volcano-admission-init-nnbfq          0/1     Completed   0          8m14s
volcano-controllers-f4b69577b-99cfp   1/1     Running     0          8m14s
volcano-scheduler-c98cb745b-kqgpr     1/1     Running     0          8m14s

kongjibai avatar Aug 16 '22 09:08 kongjibai

kubectl describe vcjob job-1

sorry, it's a long time no reply. it outputs as below, reminds pod group is not ready. it's normal in volcano-release-1.5, but failed in volcano-release-1.6. How can I sovle this problem?

Name:         job-1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2022-08-16T09:30:04Z
  Generation:          1
  Managed Fields:
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:minAvailable:
        f:state:
          .:
          f:lastTransitionTime:
          f:phase:
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-16T09:30:04Z
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:minAvailable:
        f:policies:
        f:queue:
        f:schedulerName:
        f:tasks:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2022-08-16T09:30:04Z
  Resource Version:  5063533
  UID:               2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
Spec:
  Max Retry:      3
  Min Available:  1
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           test
  Scheduler Name:  volcano
  Tasks:
    Max Retry:      3
    Min Available:  1
    Name:           nginx
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  1
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            sleep
            10m
          Image:  nginx:latest
          Name:   nginx
          Resources:
            Limits:
              Cpu:  1
            Requests:
              Cpu:       1
        Restart Policy:  Never
Status:
  Conditions:
    Last Transition Time:  2022-08-16T09:30:10Z
    Status:                Pending
  Min Available:           1
  State:
    Last Transition Time:  2022-08-16T09:30:10Z
    Phase:                 Pending
Events:
  Type     Reason           Age   From                   Message
  ----     ------           ----  ----                   -------
  Warning  PodGroupPending  2m6s  vc-controller-manager  PodGroup default:job-1 unschedule,reason: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable

kongjibai avatar Aug 16 '22 09:08 kongjibai

This output provides little useful information for debug. Have you described the podgroup for more details or take a search of the logs?

Thor-wl avatar Aug 17 '22 01:08 Thor-wl

This output provides little useful information for debug. Have you described the podgroup for more details or take a search of the logs?

the podgroup described as below, it reminds NotEnoughResources, but i'm sure the k8s cluster has enought resource, including cpu, memory and gpu. because everything is ok in volcano-release-1.5.

Name:         job-1-2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2022-08-16T09:30:04Z
  Generation:          955
  Managed Fields:
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:ownerReferences:
          .:
          k:{"uid":"2cadba62-3f14-4c59-b09b-5bfcb8cce0d5"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:minMember:
        f:minResources:
          .:
          f:count/pods:
          f:cpu:
          f:limits.cpu:
          f:pods:
          f:requests.cpu:
        f:minTaskMember:
          .:
          f:nginx:
        f:queue:
      f:status:
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-16T09:30:04Z
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:    vc-scheduler
    Operation:  Update
    Time:       2022-08-16T09:30:05Z
  Owner References:
    API Version:           batch.volcano.sh/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  job-1
    UID:                   2cadba62-3f14-4c59-b09b-5bfcb8cce0d5
  Resource Version:        5159589
  UID:                     95e8e0c3-12c1-4455-9edd-aa4e298800d9
Spec:
  Min Member:  1
  Min Resources:
    count/pods:    1
    Cpu:           1
    limits.cpu:    1
    Pods:          1
    requests.cpu:  1
  Min Task Member:
    Nginx:  1
  Queue:    test
Status:
  Conditions:
    Last Transition Time:  2022-08-17T03:00:41Z
    Message:               1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         97f14a76-f54d-441d-992d-7b732621e7a3
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                    From     Message
  ----     ------         ----                   ----     -------
  Warning  Unschedulable  58s (x62798 over 17h)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable

kongjibai avatar Aug 17 '22 03:08 kongjibai