nerd
nerd copied to clipboard
How to deal with conflicting job statuses
Expected Behavior
We've encountered a job that had multiple pods being started with the latest pod nog being the one that runs successfully. Basically it knows that one pod succeeded but it also knows one pod failed. in this case it only knows about the failed one so will show details from a different pod
Anything else we need to know?
The job that showed this behaviour:
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: batch/v1
kind: Job
metadata:
creationTimestamp: 2018-03-06T12:07:24Z
generateName: nlz-nerdj-
labels:
nerd-app: cli
name: nlz-nerdj-6sfvd
namespace: advanderveer7-default
resourceVersion: "16834"
selfLink: /apis/batch/v1/namespaces/advanderveer7-default/jobs/nlz-nerdj-6sfvd
uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
spec:
backoffLimit: 3
completions: 1
parallelism: 1
selector:
matchLabels:
controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
template:
metadata:
creationTimestamp: null
labels:
controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
job-name: nlz-nerdj-6sfvd
nerd-app: cli
spec:
containers:
- args:
- co2_calc.py
- "5"
image: nerdalize/pythonapp:v3
imagePullPolicy: IfNotPresent
name: main
resources:
limits:
cpu: "2"
memory: 3Gi
requests:
cpu: "2"
memory: 3Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /input
name: 2f696e707574
- mountPath: /output
name: 2f6f7574707574
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- flexVolume:
driver: nerdalize.com/dataset
options:
input/dataset: d-clgbn
name: 2f696e707574
- flexVolume:
driver: nerdalize.com/dataset
options:
output/dataset: d-zs4x9
name: 2f6f7574707574
status:
completionTime: 2018-03-06T12:10:54Z
conditions:
- lastProbeTime: 2018-03-06T12:10:54Z
lastTransitionTime: 2018-03-06T12:10:54Z
status: "True"
type: Complete
failed: 1
startTime: 2018-03-06T12:07:24Z
succeeded: 1
Edit cancelled, no changes made.
After inspecting the pods, it only shows the following info:
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/created-by: |
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"advanderveer7-default","name":"nlz-nerdj-6sfvd","uid":"ee15eb6a-2136-11e8-a29c-3863bb49f468","apiVersion":"batch","resourceVersion":"16123"}}
kubernetes.io/psp: user-psp
seccomp.security.alpha.kubernetes.io/pod: docker/default
creationTimestamp: 2018-03-06T12:07:24Z
generateName: nlz-nerdj-6sfvd-
labels:
controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
job-name: nlz-nerdj-6sfvd
nerd-app: cli
name: nlz-nerdj-6sfvd-lj4xf
namespace: advanderveer7-default
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: nlz-nerdj-6sfvd
uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
resourceVersion: "16563"
selfLink: /api/v1/namespaces/advanderveer7-default/pods/nlz-nerdj-6sfvd-lj4xf
uid: ee180488-2136-11e8-a29c-3863bb49f468
spec:
containers:
- args:
- co2_calc.py
- "5"
image: nerdalize/pythonapp:v3
imagePullPolicy: IfNotPresent
name: main
resources:
limits:
cpu: "2"
memory: 3Gi
requests:
cpu: "2"
memory: 3Gi
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /input
name: 2f696e707574
- mountPath: /output
name: 2f6f7574707574
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-rkc8q
readOnly: true
dnsPolicy: ClusterFirst
nodeName: bh80tm
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.alpha.kubernetes.io/notReady
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.alpha.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- flexVolume:
driver: nerdalize.com/dataset
options:
input/dataset: d-clgbn
name: 2f696e707574
- flexVolume:
driver: nerdalize.com/dataset
options:
output/dataset: d-zs4x9
name: 2f6f7574707574
- name: default-token-rkc8q
secret:
defaultMode: 420
secretName: default-token-rkc8q
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2018-03-06T12:07:24Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2018-03-06T12:09:34Z
message: 'containers with unready status: [main]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2018-03-06T12:07:24Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://0e2fa134a3b0bc392a94571a558225ff2852a036d0f86e489bd2c7f77de36a76
image: nerdalize/pythonapp:v3
imageID: docker-pullable://nerdalize/pythonapp@sha256:84ff6f7ce3be64c86d24c1b86749dfe7cd7a6871e5014de923402bd5c6a4fba0
lastState: {}
name: main
ready: false
restartCount: 0
state:
terminated:
containerID: docker://0e2fa134a3b0bc392a94571a558225ff2852a036d0f86e489bd2c7f77de36a76
exitCode: 1
finishedAt: 2018-03-06T12:09:34Z
reason: Error
startedAt: 2018-03-06T12:07:25Z
hostIP: 100.66.0.46
phase: Failed
podIP: 10.233.67.9
qosClass: Guaranteed
startTime: 2018-03-06T12:07:24Z
Edit cancelled, no changes made.