AKS
AKS copied to clipboard
[BUG] K8 Job does not get marked as completed after the pod succeeds in AKS version 1.25.5
If the GPU nodepool in AKS Version 1.25.5 has a taint with the key nvidia.com/gpu:present, Kubernetes jobs (with tolerations for this taint) scheduled on GPU nodepools are not marked as completed even after the pod has succeeded.
Steps to reproduce:
Create a GPU node pool with the taint nvidia.com/gpu:present in AKS Version 1.25.5 Create a simple k8 job. Something like
apiVersion: batch/v1
kind: Job
metadata:
name: simple-job
spec:
template:
spec:
containers:
- name: pi
image: ubuntu:16.04 ### tried with ubuntu 18.04, 20.04 and
command: ["echo", "world"]
resources:
requests:
cpu: '1'
memory: 1Gi
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
backoffLimit: 4
The pod succeeds but the job does not get marked as completed When we describe the job, we only see the message indicating that the pod has been created and no other events are displayed. Please note that this job with same taints and tolerations works fine on GPU nodepools in AKS version 1.24.3
We tried with multiple other toleration names, it was working fine with all those names but only for this toleration name nvidia.com/gpu it's behaving weirdly
Everywhere else in the code, we use this toleration key nvidia.com/gpu and it would be a big change for us to change the toleration key
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@rajivml We observed similar behavior, that was caused by an admission controller (in our case kyverno) interfering with an update request by the job-controller.
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Hi @masinger could you please confirm if the issue you reported shared the same root cause as https://github.com/Azure/AKS/issues/3549#issuecomment-1625274572
This issue will now be closed because it hasn't had any activity for 7 days after stale. rajivml feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.