roadrunner
roadrunner copied to clipboard
[💡 FEATURE REQUEST]: Metric about the number of failed jobs due to job timeout
Plugin
JOBS
I have an idea!
version 2.9 offers metric "Failed job" which help us to know the total number of failed jobs, it's great. I'd like to add my 2cent to improve it more: we can categorize 2 types of "failed" here:
- failed due to logic, app or system
- failed due to resource shortage (configure number of jobs + max timeout, 1 failed I can think in this type is failed jobs due to job timeout - job is put into queue, but system is too busy and throw it away after timeout)
We focus more about #2, and somehow it's a metric to let us know about saturation of our service --> what do you think to add new metric "failed due to timeout"?
Thanks.
Hey @lanphan 👋🏻. After a brief internal discussion, we decided to postpone this ticket. The reason is: that the information about the particular reason can be found in the logs. I guess, would be better to send logs to a loki or directly to the graphana dashboard to see all the jobs failed reasons.
Let's see for more feedback on that. Thanks for the contribution 👍🏻
The community added new metrics: Latency and RPS (requests per second) for the JOBS. Metrics don't have info about particular error messages to distinguish between different types of errors. This info you may get from the logs.