AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] K8 Job does not get marked as completed after the pod succeeds in AKS version 1.25.5

Open rajivml opened this issue 2 years ago • 17 comments

If the GPU nodepool in AKS Version 1.25.5 has a taint with the key nvidia.com/gpu:present, Kubernetes jobs (with tolerations for this taint) scheduled on GPU nodepools are not marked as completed even after the pod has succeeded.

Steps to reproduce:

Create a GPU node pool with the taint nvidia.com/gpu:present in AKS Version 1.25.5 Create a simple k8 job. Something like

apiVersion: batch/v1
kind: Job
metadata:
  name: simple-job
spec:
  template:
    spec:
      containers:
        - name: pi
          image: ubuntu:16.04 ### tried with ubuntu 18.04, 20.04 and 
          command: ["echo",  "world"]
          resources:
            requests:
              cpu: '1'
              memory: 1Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
      restartPolicy: Never
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
  backoffLimit: 4

The pod succeeds but the job does not get marked as completed When we describe the job, we only see the message indicating that the pod has been created and no other events are displayed. Please note that this job with same taints and tolerations works fine on GPU nodepools in AKS version 1.24.3

rajivml avatar Mar 18 '23 05:03 rajivml

We tried with multiple other toleration names, it was working fine with all those names but only for this toleration name nvidia.com/gpu it's behaving weirdly

Everywhere else in the code, we use this toleration key nvidia.com/gpu and it would be a big change for us to change the toleration key

rajivml avatar Mar 18 '23 05:03 rajivml

Action required from @Azure/aks-pm

ghost avatar Apr 18 '23 16:04 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar May 03 '23 18:05 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar May 19 '23 00:05 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jun 03 '23 00:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jun 18 '23 00:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 03 '23 06:07 ghost

@rajivml We observed similar behavior, that was caused by an admission controller (in our case kyverno) interfering with an update request by the job-controller.

masinger avatar Jul 07 '23 11:07 masinger

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 22 '23 12:07 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Aug 06 '23 12:08 ghost

Hi @masinger could you please confirm if the issue you reported shared the same root cause as https://github.com/Azure/AKS/issues/3549#issuecomment-1625274572

AllenWen-at-Azure avatar Sep 03 '24 09:09 AllenWen-at-Azure

This issue will now be closed because it hasn't had any activity for 7 days after stale. rajivml feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.