monitor failing jobs

Open majamassarini opened this issue 9 months ago • 1 comments

Not always a failed job is marked as a failed celery task.

To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests). And raise an alert when the number of failures is near 100% on a broad time frame (10 minutes?). Or something similar.

Mar 05 '25 09:03 majamassarini

Not always a failed job is marked as a failed celery task.

I think this is correct behaviour, because the task successfully finishes.

To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests).

We could maybe just adjust the existing metrics, e.g. copr_builds_finished/test_runs_finished to have status labels.

Mar 05 '25 13:03 lbarcziova