worker Inform hub of un-requeueable job

Inform hub of un-requeueable job

Open joepvd opened this issue 7 years ago • 0 comments

A GCE job could not get started, and the requeue ended up in error.

Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't start instance" err="context deadline exceeded"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=info msg="requeueing job"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't requeue job" err="context deadline exceeded"

Now, hub could not be informed of the failure of the job, and only after quite some time, hub did a cleanup:

travis-com-hub-production
Erroring stale job: id=123 state=received updated_at=2017-10-23 18:26:41 UTC.

Amount of occurrences of couldn't requeue job in the last 7 hours, grouped by hour (cest):

GCE .org:

GCE .com:

This means that the concurrency has been consumed by this stale job fore quite a while. It would be good if more attempts at informing hub in this error scenario would be taken.

Some extra details in this support ticket.

Oct 24 '17 13:10 joepvd

worker worker copied to clipboard

Inform hub of un-requeueable job

worker
worker copied to clipboard