roadrunner [💡 FEATURE REQUEST]: Metric about the number of failed jobs due to job timeout

[💡 FEATURE REQUEST]: Metric about the number of failed jobs due to job timeout

Open lanphan opened this issue 3 years ago • 2 comments

Plugin

JOBS

I have an idea!

version 2.9 offers metric "Failed job" which help us to know the total number of failed jobs, it's great. I'd like to add my 2cent to improve it more: we can categorize 2 types of "failed" here:

failed due to logic, app or system
failed due to resource shortage (configure number of jobs + max timeout, 1 failed I can think in this type is failed jobs due to job timeout - job is put into queue, but system is too busy and throw it away after timeout)

We focus more about #2, and somehow it's a metric to let us know about saturation of our service --> what do you think to add new metric "failed due to timeout"?

Thanks.

Apr 08 '22 08:04 lanphan

Hey @lanphan 👋🏻. After a brief internal discussion, we decided to postpone this ticket. The reason is: that the information about the particular reason can be found in the logs. I guess, would be better to send logs to a loki or directly to the graphana dashboard to see all the jobs failed reasons.

May 10 '22 12:05 rustatian

Let's see for more feedback on that. Thanks for the contribution 👍🏻

May 10 '22 12:05 rustatian

The community added new metrics: Latency and RPS (requests per second) for the JOBS. Metrics don't have info about particular error messages to distinguish between different types of errors. This info you may get from the logs.

Mar 24 '23 00:03 rustatian

roadrunner roadrunner copied to clipboard

[💡 FEATURE REQUEST]: Metric about the number of failed jobs due to job timeout

Plugin

I have an idea!

roadrunner
roadrunner copied to clipboard