nerd How to deal with conflicting job statuses

How to deal with conflicting job statuses

Open advdv opened this issue 7 years ago • 0 comments

Expected Behavior

We've encountered a job that had multiple pods being started with the latest pod nog being the one that runs successfully. Basically it knows that one pod succeeded but it also knows one pod failed. in this case it only knows about the failed one so will show details from a different pod

Anything else we need to know?

The job that showed this behaviour:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: 2018-03-06T12:07:24Z
  generateName: nlz-nerdj-
  labels:
    nerd-app: cli
  name: nlz-nerdj-6sfvd
  namespace: advanderveer7-default
  resourceVersion: "16834"
  selfLink: /apis/batch/v1/namespaces/advanderveer7-default/jobs/nlz-nerdj-6sfvd
  uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
spec:
  backoffLimit: 3
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
        job-name: nlz-nerdj-6sfvd
        nerd-app: cli
    spec:
      containers:
      - args:
        - co2_calc.py
        - "5"
        image: nerdalize/pythonapp:v3
        imagePullPolicy: IfNotPresent
        name: main
        resources:
          limits:
            cpu: "2"
            memory: 3Gi
          requests:
            cpu: "2"
            memory: 3Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /input
          name: 2f696e707574
        - mountPath: /output
          name: 2f6f7574707574
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - flexVolume:
          driver: nerdalize.com/dataset
          options:
            input/dataset: d-clgbn
        name: 2f696e707574
      - flexVolume:
          driver: nerdalize.com/dataset
          options:
            output/dataset: d-zs4x9
        name: 2f6f7574707574
status:
  completionTime: 2018-03-06T12:10:54Z
  conditions:
  - lastProbeTime: 2018-03-06T12:10:54Z
    lastTransitionTime: 2018-03-06T12:10:54Z
    status: "True"
    type: Complete
  failed: 1
  startTime: 2018-03-06T12:07:24Z
  succeeded: 1
Edit cancelled, no changes made.

After inspecting the pods, it only shows the following info:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"advanderveer7-default","name":"nlz-nerdj-6sfvd","uid":"ee15eb6a-2136-11e8-a29c-3863bb49f468","apiVersion":"batch","resourceVersion":"16123"}}
    kubernetes.io/psp: user-psp
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  creationTimestamp: 2018-03-06T12:07:24Z
  generateName: nlz-nerdj-6sfvd-
  labels:
    controller-uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
    job-name: nlz-nerdj-6sfvd
    nerd-app: cli
  name: nlz-nerdj-6sfvd-lj4xf
  namespace: advanderveer7-default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: nlz-nerdj-6sfvd
    uid: ee15eb6a-2136-11e8-a29c-3863bb49f468
  resourceVersion: "16563"
  selfLink: /api/v1/namespaces/advanderveer7-default/pods/nlz-nerdj-6sfvd-lj4xf
  uid: ee180488-2136-11e8-a29c-3863bb49f468
spec:
  containers:
  - args:
    - co2_calc.py
    - "5"
    image: nerdalize/pythonapp:v3
    imagePullPolicy: IfNotPresent
    name: main
    resources:
      limits:
        cpu: "2"
        memory: 3Gi
      requests:
        cpu: "2"
        memory: 3Gi
    securityContext:
      allowPrivilegeEscalation: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /input
      name: 2f696e707574
    - mountPath: /output
      name: 2f6f7574707574
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-rkc8q
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: bh80tm
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.alpha.kubernetes.io/notReady
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.alpha.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - flexVolume:
      driver: nerdalize.com/dataset
      options:
        input/dataset: d-clgbn
    name: 2f696e707574
  - flexVolume:
      driver: nerdalize.com/dataset
      options:
        output/dataset: d-zs4x9
    name: 2f6f7574707574
  - name: default-token-rkc8q
    secret:
      defaultMode: 420
      secretName: default-token-rkc8q
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-03-06T12:07:24Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-03-06T12:09:34Z
    message: 'containers with unready status: [main]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2018-03-06T12:07:24Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://0e2fa134a3b0bc392a94571a558225ff2852a036d0f86e489bd2c7f77de36a76
    image: nerdalize/pythonapp:v3
    imageID: docker-pullable://nerdalize/pythonapp@sha256:84ff6f7ce3be64c86d24c1b86749dfe7cd7a6871e5014de923402bd5c6a4fba0
    lastState: {}
    name: main
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://0e2fa134a3b0bc392a94571a558225ff2852a036d0f86e489bd2c7f77de36a76
        exitCode: 1
        finishedAt: 2018-03-06T12:09:34Z
        reason: Error
        startedAt: 2018-03-06T12:07:25Z
  hostIP: 100.66.0.46
  phase: Failed
  podIP: 10.233.67.9
  qosClass: Guaranteed
  startTime: 2018-03-06T12:07:24Z
Edit cancelled, no changes made.

Mar 06 '18 13:03 advdv

nerd nerd copied to clipboard

How to deal with conflicting job statuses

Expected Behavior

Anything else we need to know?

nerd
nerd copied to clipboard