frankenphp
frankenphp copied to clipboard
Metrics to evaluate number of workers
Hi. First of all: great project! In only two days I managed to switch from php pm to frankenphp. I hope this project will be maintained for a longer time ;) As I'm not familiar with Caddy, I have no idea how to monitor the usage of the Caddy workers: it would be nice to have a easy way to figure out how much percent of the time the workers are busy, so I can detect or even predict an overload of the workers. Maybe this is already possible, then a pointer in the documentation would be nice.
Hi and thanks for using FrankenPHP!
Caddy exposes various Prometheus metrics, and we could add more if necessary, but I'm not sure how we can compute the usage of workers.
@dunglas Is there currently any way to monitor deployments for worker starvation?
A simple metric as "workers_total" and "workers_busy" (or "workers_available") could work.
This is possible for our nginx, apache2, and swoole deployments, and it is a crucial metric for adjusting the worker count/replicas.
@LauJosefsen currently, there are no such metrics, but this should be easy to add. I'll try to work on it (but feel free to open a PR if you feel comfortable adding this feature).
+1 on this as then it would be easy to set up HPA on k8s based on these metrics. e.g if 80% (workers_total / workers_working) of workers are busy spin up new pod. Example of RR metrics that would be useful: https://docs.roadrunner.dev/docs/logging-and-observability/metrics @dunglas
is there any updated here?
What metrics would be useful to add?
- num threads configured
- num non-worker threads
- num worker threads (foreach configured worker)
maybe as a start?
What metrics would be useful to add?
- num threads configured
- num non-worker threads
- num worker threads (foreach configured worker)
maybe as a start?
For sure the metric of worker utilisation is useful.
Num of idle workers, num of busy workers. Very useful for auto scaling instances and alerting.
I'm not too familiar with the inner workings of frankenphp, but if all workers are busy, are new requests enqueued or dropped? If enqueued, then number of enqueued requests is useful as well.
Currently I'm having performance issues, and I would like to know whether the slow response times are due to 1. wait time before processing or 2. processing time, and how these two are evolving. That way I can assess whether the workload is just too much, or whether there's something like thermal throttling going on. Now I just see the frankenphp process pegging the CPU.
For sure the metric of worker utilisation is useful.
One of the first changes I make when testing is a timer that outputs channel length (after making the channel buffered) every couple of seconds, so I can tell when my test has saturated the buffers. I can see how this might be a general metric worth reporting. If that buffer gets full, latency shoots through the roof.
Now I just see the frankenphp process pegging the CPU.
If you see that, then traffic is too high and you need to scale up. CPU usage should be hanging around 30% for the lowest latency (at least on my workload). >60-70% means you need more cpus soon.
Also, make sure you don't have too many threads. Context switching can eat up all your cpu time, for no real benefit. If you see a high "steal" time in top, you have too many threads for your machine.
Well usually I'm well below 20% and and steal is zero. For an unchanged request rate, the CPU occasionally goes through the roof, and so does the CPU temperature. I'd think it be a good idea to analyze it before throwing more hardware at it. Anyway, I'm not here to discuss my problem, only the fact that additional metrics are useful in order to investigate such problems.
For an unchanged request rate, the CPU occasionally goes through the roof, and so does the CPU temperature.
Can you open a separate issue for this? That doesn't sound right, and it might be a bug or something.
I've been looking at Caddy source, and correct me if I'm mistaken, but it looks like Caddy does not support modules adding custom metrics?
If im not mistaken, then I guess there is either two solutions, either add the support for a metric module in caddy, or make a custom route for frankenphp metrics?
This is definitely possible to add custom metrics in the existing endpoint. I've done it for http://mercure.rocks: https://github.com/dunglas/mercure/blob/main/caddy/caddy.go#L30
What metrics would be useful to add?
- num threads configured
- num non-worker threads
- num worker threads (foreach configured worker)
maybe as a start?
In my case, I need it for HPA, so for example, if 80% of workers are busy or invalid spin a new pod, so the total number of threads and amount of busy or inactive threads can be enough. Ideally, we can split them into non-worker threads, and threads for each worker. But for the beginning, the total number of threads and the number of busy or inactive threads can be enough.
Busy over what time period or should it be instantaneous?
How is this not solved via the caddy_http_requests_in_flight metric?
so you mean, if we for example have num_thread = 100, and caddy_http_requests_in_flight we can thread it as 80% of busy workers? maybe that can work but we also need to have also total amount of threads metric, is that already exposed?
Currently, with FPM we are using https://github.com/hipages/php-fpm_exporter like this:
max((100 / phpfpm_total_processes_gauge) * phpfpm_active_processes_gauge) by (<<.GroupBy>>)
I'm def agreeing that we should have these new metrics, I'm just wondering how they would be different from caddy_http_requests_in_flight.
From some experimentation, caddy_http_requests_in_flight will be greater than the number of workers (meaning requests are queued up waiting for workers by caddy). So, I think these metrics can be quite useful.
Here's what I'm thinking:
frankenphp_[worker_name]_total_workers: for-each worker script defined, the total workers will be given.
frankenphp_total_php_threads: the total number of php workers (this minus the above metric gives the number of threads available to handle cgi-mode requests).
frankenphp_busy_threads: the total number of threads executing php code. Note: for workers, these will always count the worker script. So a dead worker will not be counted here.
frankenphp_[worker_name]_busy_workers: the total number of workers executing a request
This should cover worker-only, cgi-mode, and mixed workloads. I think. Am I missing anything?
Oh, one thing that may be of interest:
frankenphp_[worker_name]_non_request_time: the time spent between requests doing worker script stuff (GCing, resetting containers, etc). However, I'm not sure how provide that as a measure. Maybe accumulative time?
@withinboredom is there any update on this? thank you.
The last half of my months are usually dedicated to paid work, and the first half to open source projects. So, expect more movement next week.