monitor failing jobs
Not always a failed job is marked as a failed celery task.
To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests). And raise an alert when the number of failures is near 100% on a broad time frame (10 minutes?). Or something similar.
Not always a failed job is marked as a failed celery task.
I think this is correct behaviour, because the task successfully finishes.
To automatically detect this kind of situations we should create new variables (successful builds/tests, failed builds/tests), collect and send them to the pushgateway (as we do for the queued and started builds/tests).
We could maybe just adjust the existing metrics, e.g. copr_builds_finished/test_runs_finished to have status labels.