k8s-sentry Failed CronJob runs get re-raised until cleaned up (and don't have message)

Failed CronJob runs get re-raised until cleaned up (and don't have message)

Open daaain opened this issue 3 years ago • 10 comments

I found an odd issue, it seems that if there's a Failed run of a CronJob, k8s-sentry will keep on re-raising it as a Sentry issue approximately 8 times an hour until it's removed.

The only workaround is to delete these manually:

kubectl delete pods --field-selector status.phase=Failed --all-namespaces

It also seems that the error message / reason is missing, the actualy issue was container exiting with exit code 1.

Oct 16 '20 15:10 daaain

Having this issue also with a regualr job... complete blew out my dev sentry quote o.o

Oct 17 '20 18:10 williscool

That definitely sounds like a real bug. I'll try to reproduce this.

Oct 19 '20 18:10 wichert

@daaain Isn't the behaviour correct for you? I would expect a Sentry event every time the created Pod failed. For example with this Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: failure
spec:
  template:
    spec:
      containers:
      - name: failure
        image: busybox
        command: ["sh",  "-c", "echo Failing now ; /bin/false"]
      restartPolicy: Never
  backoffLimit: 4

Kubernetes will try to create and run a pod four times, resulting in four failure events. It sounds like your CronJob is setup to do 8 attempts per hour?

Oct 19 '20 18:10 wichert

@williscool Can you show me what your Job resource looks like? In my test k8s-sentry does not report more errors than the number of times Kubernetes tries to run the job, so I'm wondering how you get 1500 events for a single Job.

Oct 19 '20 18:10 wichert

@wichert it's a single run of the CronJob which was repeated in Sentry, the runs afterwards were successful.

It's running in a cluster with 1.16.13-gke.401

This is the template from GKE with some irrelevant bits removed:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: etl-sync-worker-bike-public-module-state
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      labels:
        app: etl-sync-worker-bike-public-module-state
    spec:
      backoffLimit: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command: ...
            image: ...
            name: etl-sync-worker-bike-public-module-state
          restartPolicy: Never
          terminationGracePeriodSeconds: 30
  schedule: 0/2 * * * *
  startingDeadlineSeconds: 60
  successfulJobsHistoryLimit: 3

Oct 19 '20 19:10 daaain

Oh actually, just realised that this might be relevant, I have a postStart lifecycle command which might be the one which failed the run which would explain why there wasn't any output.

Oct 19 '20 20:10 daaain

So my job is written in pulmi typescript k8s but its pretty easy to follow from what a normal yaml file looks like

the main event is

yarn install --production=false --no-progress && yarn jest

so far I think this is what the issue was ... I had a test that was failing in a way that kept a database connection open which hung the jest process and so it kept retuning exit 1 ... somehow every couple of minitues k8s-sentry was observing that failure and sending it to sentry

const testJob = new k8s.batch.v1.Job(`${projectName}-test-job`, {
    spec: {
        backoffLimit: 0, // only run once
        template: {
            metadata: {
                generateName: `${projectName}-test-job-`,
            },
            spec: {
                containers: [
                    { 
                        name: `${projectName}-test`,
                        image: watcherImage,
                        command: ["/bin/sh"],
                        args: ["-c", "yarn install --production=false --no-progress && yarn jest"],
                        env: [
                            { name: "DD_ENV", value: pulumi.getStack() },
                            { name: "SENTRY_DSN", value: notificationServiceSentryDsn },
                            { name: "PG_DATABASE_URL", value: dburl },
                            ///..  more env vars and such...
                        ]

                    },
                ],
                restartPolicy: "Never",
            }
        },
    },
}, { provider: cluster.provider });

had to set a filter

and by the looks of things

its still an issue even though I have this test passing now

Oct 19 '20 23:10 williscool

Some new info: I had a CronJob failing overnight a few times by running out of memory and getting OOMKilled with exit code 137 which didn't trigger the repetition, but also didn't have any error message after the Pod name, so at least one part of the problem seems to be easier to reproduce.

Nov 02 '20 10:11 daaain

We've been hit by this bug a few times as well, most recently over the last ~22 hours.

We have a corn job that runs every 5 minutes in Kubernetes and it had 1 failed invocation yesterday which was retried 3 times, and we only noticed today that these 4 failed pods resulting from the issue had caused 10k errors reported to Sentry. Deleting the failed pods stopped the errors from being reported.

The number of errors reported approximately matches 4 pods each getting an error reported every 30 seconds for 22 hours.

Looking through the source code, it stands out to me the isNewTermination check only is used if it's not the entire pod that has failed. If that could be added for fully failed pods we could at least avoid getting the errors reported to Sentry repeatedly, chewing away at the quotas.

Would that be a change that would make sense?

Sep 27 '21 12:09 Tenzer

I can see https://github.com/wichert/k8s-sentry/pull/14 already makes that change as part of a bigger change in how pods are handled when they fail, and getting that change merged it would help solve this problem, I believe.

Sep 27 '21 12:09 Tenzer

k8s-sentry k8s-sentry copied to clipboard

Failed CronJob runs get re-raised until cleaned up (and don't have message)

k8s-sentry
k8s-sentry copied to clipboard