worker icon indicating copy to clipboard operation
worker copied to clipboard

Inform hub of un-requeueable job

Open joepvd opened this issue 7 years ago • 0 comments

A GCE job could not get started, and the requeue ended up in error.

Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't start instance" err="context deadline exceeded"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=info msg="requeueing job"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't requeue job" err="context deadline exceeded" 

Now, hub could not be informed of the failure of the job, and only after quite some time, hub did a cleanup:

travis-com-hub-production
Erroring stale job: id=123 state=received updated_at=2017-10-23 18:26:41 UTC. 

Amount of occurrences of couldn't requeue job in the last 7 hours, grouped by hour (cest):

GCE .org:

06 321
07 58
08 1
09 3
10 2
11 2
13 2

GCE .com:

06 232
07 47
09 9
10 1
11 59
12 5
13 3

This means that the concurrency has been consumed by this stale job fore quite a while. It would be good if more attempts at informing hub in this error scenario would be taken.

Some extra details in this support ticket.

joepvd avatar Oct 24 '17 13:10 joepvd