k8up icon indicating copy to clipboard operation
k8up copied to clipboard

Jobs are stuck in terminating when using Kubernetes Job Tracking

Open vring0 opened this issue 3 years ago • 6 comments

Description

A lot of jobs do not complete, but hang with the "Terminating" state. I edit the job manually and set the finalizers: []

Additional Context

No response

Logs

No response

Expected Behavior

Jobs complete successfully and should not be stuck with the "Terminating" status.

Steps To Reproduce

No response

Version of K8up

v2.1.2

Version of Kubernetes

v1.22.6+k3s1

Distribution of Kubernetes

k3s

vring0 avatar Jun 16 '22 09:06 vring0

Hi. Can you please provide any logs, your specs and the steps to reproduce? It would also be helpful to post the YAML of the job before removing manually the finalizers. It's hard to help you if we don't have the relevant information and can't reproduce the error ourselves. Thank you :)

ccremer avatar Jun 22 '22 13:06 ccremer

Hi @vring0

Could you please provide the information requested by ccremer? Otherwise it's not possible for us to help you with the issues you're having.

Kidswiss avatar Jul 04 '22 11:07 Kidswiss

I am currently on vacation. My problem is real. I will write on my return.

vring0 avatar Jul 05 '22 13:07 vring0

I changed the IP addresses in the variables to 127.0.0.1

Terminaiting Job Yaml

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    batch.kubernetes.io/job-tracking: ""
  creationTimestamp: "2022-07-22T04:00:03Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-07-22T05:00:17Z"
  finalizers:
  - foregroundDeletion
  generation: 2
  labels:
    k8up.io/type: backup
    k8upjob: "true"
  name: backup-schedule-backup-test123-backup-ptzxv
  namespace: test123
  ownerReferences:
  - apiVersion: k8up.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Backup
    name: schedule-backup-test123-backup-ptzxv
    uid: 58f7adbf-13b6-4aa2-b19b-43568ca3832d
  resourceVersion: "18690552"
  uid: 057b0203-b64f-4f15-8b0b-0b6211a12a6a
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 057b0203-b64f-4f15-8b0b-0b6211a12a6a
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 057b0203-b64f-4f15-8b0b-0b6211a12a6a
        job-name: backup-schedule-backup-test123-backup-ptzxv
        k8upjob: "true"
    spec:
      containers:
      - command:
        - /usr/local/bin/k8up
        - restic
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              key: AWS_ACCESS_KEY_ID
              name: aws-credentials
        - name: STATS_URL
        - name: FILEEXTENSION_ANNOTATION
          value: k8up.io/file-extension
        - name: PROM_URL
          value: http://127.0.0.1/
        - name: BACKUPCOMMAND_ANNOTATION
          value: k8up.io/backupcommand
        - name: HOSTNAME
          value: test123
        - name: RESTIC_PASSWORD
          valueFrom:
            secretKeyRef:
              key: K8UP_PASSWORD
              name: backup-repo
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: AWS_SECRET_ACCESS_KEY
              name: aws-credentials
        - name: RESTIC_REPOSITORY
          value: s3:http://127.0.0.1:9000/test123-master
        image: ghcr.io/k8up-io/k8up:v2.1.2
        imagePullPolicy: IfNotPresent
        name: backup
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data/test123-claim
          name: test123-claim
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 13001
      serviceAccount: pod-executor
      serviceAccountName: pod-executor
      terminationGracePeriodSeconds: 30
      volumes:
      - name: test123-claim
        persistentVolumeClaim:
          claimName: test123-claim
status:
  completionTime: "2022-07-22T04:00:17Z"
  conditions:
  - lastProbeTime: "2022-07-22T04:00:17Z"
    lastTransitionTime: "2022-07-22T04:00:17Z"
    status: "True"
    type: Complete
  startTime: "2022-07-22T04:00:03Z"
  succeeded: 1
  uncountedTerminatedPods: {}

Terminaiting Pod Yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-07-22T04:00:03Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-07-22T05:00:17Z"
  finalizers:
  - batch.kubernetes.io/job-tracking
  generateName: backup-schedule-backup-test123-backup-ptzxv--1-
  labels:
    controller-uid: 057b0203-b64f-4f15-8b0b-0b6211a12a6a
    job-name: backup-schedule-backup-test123-backup-ptzxv
    k8upjob: "true"
  name: backup-schedule-backup-test123-backup-ptzxv--1-lfv7n
  namespace: test123
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: backup-schedule-backup-test123-backup-ptzxv
    uid: 057b0203-b64f-4f15-8b0b-0b6211a12a6a
  resourceVersion: "18690553"
  uid: 14227250-376b-410c-bf5c-d00996402a39
spec:
  containers:
  - command:
    - /usr/local/bin/k8up
    - restic
    env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: aws-credentials
    - name: STATS_URL
    - name: FILEEXTENSION_ANNOTATION
      value: k8up.io/file-extension
    - name: PROM_URL
      value: http://127.0.0.1/
    - name: BACKUPCOMMAND_ANNOTATION
      value: k8up.io/backupcommand
    - name: HOSTNAME
      value: test123
    - name: RESTIC_PASSWORD
      valueFrom:
        secretKeyRef:
          key: K8UP_PASSWORD
          name: backup-repo
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: aws-credentials
    - name: RESTIC_REPOSITORY
      value: s3:http://127.0.0.1:9000/test123-master
    image: ghcr.io/k8up-io/k8up:v2.1.2
    imagePullPolicy: IfNotPresent
    name: backup
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /data/test123-claim
      name: test123-claim
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xftnb
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: prod-2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    runAsUser: 13001
  serviceAccount: pod-executor
  serviceAccountName: pod-executor
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: test123-claim
    persistentVolumeClaim:
      claimName: test123-claim
  - name: kube-api-access-xftnb
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-22T04:00:03Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-07-22T04:00:17Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-07-22T04:00:17Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-22T04:00:03Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://f3fe195b3f9ef93cb30d386f1294ec293732aa29bb8fbc16d7c5ca324afc3510
    image: ghcr.io/k8up-io/k8up:v2.1.2
    imageID: ghcr.io/k8up-io/k8up@sha256:533b4c016d392fd70b52bae69d6321da46d41af01e6d8095da1f2da7610c0c0c
    lastState: {}
    name: backup
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://f3fe195b3f9ef93cb30d386f1294ec293732aa29bb8fbc16d7c5ca324afc3510
        exitCode: 0
        finishedAt: "2022-07-22T04:00:16Z"
        reason: Completed
        startedAt: "2022-07-22T04:00:04Z"
  hostIP: 142.91.9.214
  phase: Succeeded
  podIP: 10.42.2.58
  podIPs:
  - ip: 10.42.2.58
  qosClass: BestEffort
  startTime: "2022-07-22T04:00:03Z"

Logs backup Pod

1.658462404121403e+09	INFO	k8up	Starting k8up…	{"version": "2.1.2", "date": "2022-03-30T11:52:11Z", "commit": "18c6c8c325db35068327533906b158ea29ace4ae", "go_os": "linux", "go_arch": "amd64", "go_version": "go1.17.7", "uid": 13001, "gid": 0}
1.658462404121761e+09	INFO	k8up.restic	initializing
1.6584624041217833e+09	INFO	k8up.restic	setting up a signal handler
1.658462404122068e+09	INFO	k8up.restic.restic	using the following restic options	{"options": [""]}
1.6584624041222034e+09	INFO	k8up.restic.restic.RepoInit.command	restic command	{"path": "/usr/local/bin/restic", "args": ["init", "--option", ""]}
1.6584624041222608e+09	INFO	k8up.restic.restic.RepoInit.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
1.658462404450951e+09	INFO	k8up.restic.restic.unlock	unlocking repository	{"all": false}
1.6584624044510314e+09	INFO	k8up.restic.restic.unlock.command	restic command	{"path": "/usr/local/bin/restic", "args": ["unlock", "--option", ""]}
1.658462404451058e+09	INFO	k8up.restic.restic.unlock.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
1.658462405552401e+09	INFO	k8up.restic.restic.unlock.restic.stderr	successfully removed locks
1.6584624055558498e+09	INFO	k8up.restic.restic.snapshots	getting list of snapshots
1.6584624055560563e+09	INFO	k8up.restic.restic.snapshots.command	restic command	{"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
1.6584624055561106e+09	INFO	k8up.restic.restic.snapshots.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
1.6584624097351136e+09	INFO	k8up.restic.k8sClient	listing all pods	{"annotation": "k8up.io/backupcommand", "namespace": "test123"}
1.658462409842674e+09	INFO	k8up.restic	backups of annotated jobs have finished successfully
1.6584624098427212e+09	INFO	k8up.restic.restic.backup	starting backup
1.6584624098428273e+09	INFO	k8up.restic.restic.backup	starting backup for folder	{"foldername": "test123-claim"}
1.6584624098428476e+09	INFO	k8up.restic.restic.backup.command	restic command	{"path": "/usr/local/bin/restic", "args": ["backup", "--option", "", "--json", "--host", "test123", "/data/test123-claim"]}
1.6584624098428583e+09	INFO	k8up.restic.restic.backup.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}
1.6584624139119163e+09	INFO	k8up.restic.restic.backup.progress	progress of backup	{"percentage": "0.00%"}
1.658462414998352e+09	INFO	k8up.restic.restic.backup.progress	backup finished	{"new files": 0, "changed files": 1, "errors": 0}
1.6584624149984374e+09	INFO	k8up.restic.restic.backup.progress	stats	{"time": 4.297125158, "bytes added": 4319383, "bytes processed": 93481350}
1.6584624149986672e+09	INFO	k8up.restic.statsHandler.promStats	sending prometheus stats	{"url": "http://127.0.0.1/"}
1.6584624150027742e+09	ERROR	k8up.restic.restic	prometheus send failed	{"error": "Post \"http://127.0.0.1/metrics/job/restic_backup/instance/test123\": dial tcp 127.0.0.1:80: connect: connection refused"}
github.com/k8up-io/k8up/restic/logging.(*BackupOutputParser).out
	/home/runner/work/k8up/k8up/restic/logging/logging.go:162
github.com/k8up-io/k8up/restic/logging.writer.Write
	/home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
	/opt/hostedtoolcache/go/1.17.7/x64/src/io/io.go:425
io.Copy
	/opt/hostedtoolcache/go/1.17.7/x64/src/io/io.go:382
os/exec.(*Cmd).writerDescriptor.func1
	/opt/hostedtoolcache/go/1.17.7/x64/src/os/exec/exec.go:311
os/exec.(*Cmd).Start.func1
	/opt/hostedtoolcache/go/1.17.7/x64/src/os/exec/exec.go:441
1.6584624150521045e+09	INFO	k8up.restic.restic.backup	backup finished, sending snapshot list
1.658462415052187e+09	INFO	k8up.restic.restic.snapshots	getting list of snapshots
1.658462415052233e+09	INFO	k8up.restic.restic.snapshots.command	restic command	{"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
1.658462415052272e+09	INFO	k8up.restic.restic.snapshots.command	Defining RESTIC_PROGRESS_FPS	{"frequency": 0.016666666666666666}

vring0 avatar Jul 27 '22 06:07 vring0

Hi. I see you have enabled the Job tracking feature gate on the cluster (https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-tracking-with-finalizers) This is a beta feature of Kubernetes 1.23 (Alpha in K8s 1.22) and we have already the same issue discussed here: https://github.com/k8up-io/k8up/issues/554

In short, I believe this is an issue on Kubernetes' side, not in K8up. K8up doesn't add this finalizer.

ccremer avatar Jul 27 '22 08:07 ccremer

@vring0

Do you still encounter this issue with Kubernetes 1.23+? According to https://github.com/k8up-io/k8up/issues/554 it seems to work with 1.23.

Kidswiss avatar Sep 08 '22 08:09 Kidswiss

@vring0

Вы все еще сталкиваетесь с этой проблемой в Kubernetes 1.23+? Судя по #554 , вроде работает с 1.23.

Not yet, thx!

vring0 avatar Nov 11 '22 07:11 vring0