roadrunner
roadrunner copied to clipboard
[💡FEATURE REQUEST]: Prometheus metrics for queued and rejected requests (load monitoring)
Problem description
It is currently not possible to detect/monitor the situations when 1) Roadrunner has started queuing requests because simultaneous requests > num_workers and 2) "no free workers in the pool" condition has happened. It is actually possible to see the "no free workers in the pool" condition in the log but only post factum.
The solution I'd like I'd like to have additional Prometheus metrics for load monitoring.
- The metric for Situation 1 should show the current queue size (0 if no request queuing is happening).
- The metric for Situation 2 is more difficult to define. Ideally, it should not only count the number of "no free workers in the pool" occurrences already happened, but allow to predict the situation by calculating "required seconds in the backend" or something.
But even just exposing the number of "no free workers in the pool" occurrences to Prometheus could be useful.
Describe alternatives you've considered
- There is no viable alternative for Situation 1.
- You can grep the logs for "no free workers in the pool" and use it as a hint to increase num_workers or allocation timeout, but this solution has no predictive power.
- The metric for Situation 1 should show the current queue size (0 if no request queuing is happening).
It's possible. I moved http metrics to the middleware, so, any http metrics will be available.
- "no free workers in the pool" condition has happened.
This is not an HTTP metric. This can be seen in the response status (non 200). Logs will show the actual reason. But I'll try to see, what's can be done here.
"no free workers in the pool" condition has happened.
This is not an HTTP metric. This can be seen in the response status (non 200). Logs will show the actual reason. But I'll try to see, what's can be done here.
The idea behind this feature request is that when there is a "no free workers" message in the log, it's already too late, we have lost the requests and customers are unhappy. It's better to know ahead of time that there are "too few free workers" and the DoS condition is imminent. Can this "number of free workers" be calculated from some internal data and exposed to Prometheus?
DDoS doesn't relate to this in any way. This is a task for the LB to filter requests, apply a circuit breaker to prevent application DDoS.
"too few free workers"
This is not smt unexpected for you. You already know your target load as well as allocation_timeout and other system parameters. So, you already know the maximum RPS per-pod for example. You only need to scale your system based on your parameters.
Important note - I'm not against your proposal, the general idea was proposed by other people as well and I agree to do that. But at the moment we have a lot of high-priority tasks, so, this particular metric will not appear soon.
DDoS doesn't relate to this in any way. This is a task for the LB to filter requests, apply a circuit breaker to prevent application DDoS.
I never mentioned a DDoS. I said "DoS condition" which simply means "denial of service" and does not have to be the result of a deliberate attack. It can happen when your app is overwhelmed with legitimate requests because you have become famous and rich overnight.
"too few free workers"
This is not smt unexpected for you. You already know your target load as well as
allocation_timeoutand other system parameters. So, you already know the maximum RPS per-pod for example. You only need to scale your system based on your parameters.
What about autoscaling?
Important note - I'm not against your proposal, the general idea was proposed by other people as well and I agree to do that. But at the moment we have a lot of high-priority tasks, so, this particular metric will not appear soon.
Great, it is not an urgent issue for me either. The bug with static headers (and the ability to have multiple static dirs) is much more important, for example. I'm always glad to see rr improve.
What about autoscaling?
This is also not a RR issue. RR can't autoscale and this is a DevOps task.
I never mentioned a DDoS. I said "DoS condition" which simply means "denial of service" and does not have to be the result of a deliberate attack. It can happen when your app is overwhelmed with legitimate requests because you have become famous and rich overnight.
Yes, sure. But this is an infra misconfiguration problem. The application endpoints should never be exposed to the open internet and should be protected by the infra.
Great, it is not an urgent issue for me either. The bug with static headers (and the ability to have multiple static dirs) is much more important, for example. I'm always glad to see rr improve.
Yes, the bug with the static files (as well as the new static files config) will be resolved in the v2.6.0.
This is also not a RR issue. RR can't autoscale and this is a DevOps task.
I agree, but to set up autoscaling, a DevOps needs feedback from the application server, preferably in the form of resource metrics (though indirect indicators like CPU load may also be used, for the lack of better metrics).
I never mentioned a DDoS. I said "DoS condition" which simply means "denial of service" and does not have to be the result of a deliberate attack. It can happen when your app is overwhelmed with legitimate requests because you have become famous and rich overnight.
Yes, sure. But this is an infra misconfiguration problem. The application endpoints should never be exposed to the open internet and should be protected by the infra.
What difference will it make to the customer if he is denied by a RR congestion or by the infrastructure? He will be unhappy either way. The infrastructure should not deny legitimate requests IMHO.
Great, it is not an urgent issue for me either. The bug with static headers (and the ability to have multiple static dirs) is much more important, for example. I'm always glad to see rr improve.
Yes, the bug with the static files (as well as the new static files config) will be resolved in the
v2.6.0.
Great news!
I agree, but to set up autoscaling, a DevOps needs feedback from the application server, preferably in the form of resource metrics (though indirect indicators like CPU load may also be used, for the lack of better metrics).
You don't need such rich feedback, because most of these metrics are constant OR have their upper limits. For example, if your load exceeds 100RPS, based on the calculations, you should add 1 extra pod for example. You don't need some extra metric from the RR to calculate and apply this. All this stat can be exposed from the LB.
What difference will it make to the customer if he is denied by a RR congestion or by the infrastructure? He will be unhappy either way. The infrastructure should not deny legitimate requests IMHO.
The difference is that every domain field should be applied properly. You should not mix the application and infra.
You don't need such rich feedback, because most of these metrics are constant OR have their upper limits. For example, if your load exceeds 100RPS, based on the calculations, you should add 1 extra pod for example. You don't need some extra metric from the RR to calculate and apply this. All this stat can be exposed from the LB.
Well, the RPS can be received directly from RR, from rate(rr_http_request_total), and this is great and very convenient.
However, when RR is congested, it logs the "no free workers in the pool" message. Is there a gauge somewhere internally in RR with a number of free workers? Or a number of free workers / num_workers? Or busy_workers/free_workers?
What difference will it make to the customer if he is denied by a RR congestion or by the infrastructure? He will be unhappy either way. The infrastructure should not deny legitimate requests IMHO.
The difference is that every domain field should be applied properly. You should not mix the application and infra.
Indeed, it was not me who introduced the question of infra into the discussion. Agree, let's drop it and concentrate on RR.
Is there a gauge somewhere internally in RR with a number of free workers? Or a number of free workers / num_workers? Or busy_workers/free_workers?
There are no particular numbers, but you can use the status plugin to obtain statuses. There are readiness (for the k8s) and health checks.
Is there a gauge somewhere internally in RR with a number of free workers? Or a number of free workers / num_workers? Or busy_workers/free_workers?
There are no particular numbers, but you can use the
statusplugin to obtain statuses. There are readiness (for the k8s) and health checks.
Thanks for pointing it out, I'll use it for a readinessProbe. I'm already using it for a livenessProbe.
+1 for exposing number of free/busy workers. It would allow to scale in advance, if there is some I/O blocking in workers, and they all become busy.
+1 for exposing number of free/busy workers. It would allow to scale in advance, if there is some I/O blocking in workers, and they all become busy.
No +1 please, put the thumb up in the first message. Thanks.
@victor-sudakov Have you tried the new grafana dashboard: https://github.com/roadrunner-server/roadrunner/tree/master/grafana?
- Done: https://github.com/roadrunner-server/prometheus/blob/master/plugin.go#L52 (have a look at the stable branch for the grafana dashboards)
- Since we providing the latency: https://github.com/roadrunner-server/prometheus/blob/master/plugin.go#L71 and the
queue_size, this metric could be easily calculated on-demand. Thanks.