osbuild-composer
osbuild-composer copied to clipboard
prometheus: heartbeat failures are not counted as failures
When a job fails due to the heartbeat timing out, it is not counted as a failed job. This means we overestimate our SLI.
Other metrics are also not captured about the job in this case.
cc @kingsleyzissou @croissanne @ondrejbudai
Turns out I can't read. Ignore me.
Still worth an investigation. I had multiple timeouts today and none of them showed up on the dashboard:
I looked into this a little bit last week. When testing locally, the errors did show up in prometheus...
Unfortunately our prometheus logs only go as far back as December. We've don't log when we're removing an unresponsive job anymore... so it's hard to say when the errors are happening.
I could add this line back in and then try monitor (but open to other suggestions): https://github.com/osbuild/osbuild-composer/commit/626530818d4b39839a49b6477b16ffb9e2c0d6fb#diff-e9b0577660bfdf5be86ff316154728805fbbc038187e047e9e85f8290da3c473L114
I looked into this a little bit last week. When testing locally, the errors did show up in prometheus...
Unfortunately our prometheus logs only go as far back as December. We've don't log when we're removing an unresponsive job anymore... so it's hard to say when the errors are happening.
I could add this line back in and then try monitor (but open to other suggestions): 6265308#diff-e9b0577660bfdf5be86ff316154728805fbbc038187e047e9e85f8290da3c473L114
Yea let's log it at least.