frankenphp Metrics to evaluate number of workers

Hi. First of all: great project! In only two days I managed to switch from php pm to frankenphp. I hope this project will be maintained for a longer time ;) As I'm not familiar with Caddy, I have no idea how to monitor the usage of the Caddy workers: it would be nice to have a easy way to figure out how much percent of the time the workers are busy, so I can detect or even predict an overload of the workers. Maybe this is already possible, then a pointer in the documentation would be nice.

Oct 23 '23 14:10 mathieudz

Hi and thanks for using FrankenPHP!

Caddy exposes various Prometheus metrics, and we could add more if necessary, but I'm not sure how we can compute the usage of workers.

Oct 23 '23 15:10 dunglas

@dunglas Is there currently any way to monitor deployments for worker starvation?

A simple metric as "workers_total" and "workers_busy" (or "workers_available") could work.

This is possible for our nginx, apache2, and swoole deployments, and it is a crucial metric for adjusting the worker count/replicas.

Jul 08 '24 12:07 LauJosefsen

@LauJosefsen currently, there are no such metrics, but this should be easy to add. I'll try to work on it (but feel free to open a PR if you feel comfortable adding this feature).

Jul 08 '24 13:07 dunglas

+1 on this as then it would be easy to set up HPA on k8s based on these metrics. e.g if 80% (workers_total / workers_working) of workers are busy spin up new pod. Example of RR metrics that would be useful: https://docs.roadrunner.dev/docs/logging-and-observability/metrics @dunglas

Jul 11 '24 14:07 AlexOstrovsky

is there any updated here?

Jul 26 '24 13:07 bogdandubyk

What metrics would be useful to add?

num threads configured
num non-worker threads
num worker threads (foreach configured worker)

maybe as a start?

Jul 26 '24 22:07 withinboredom

What metrics would be useful to add?

num threads configured

num non-worker threads

num worker threads (foreach configured worker)

maybe as a start?

For sure the metric of worker utilisation is useful.

Num of idle workers, num of busy workers. Very useful for auto scaling instances and alerting.

I'm not too familiar with the inner workings of frankenphp, but if all workers are busy, are new requests enqueued or dropped? If enqueued, then number of enqueued requests is useful as well.

Jul 29 '24 06:07 LauJosefsen

Currently I'm having performance issues, and I would like to know whether the slow response times are due to 1. wait time before processing or 2. processing time, and how these two are evolving. That way I can assess whether the workload is just too much, or whether there's something like thermal throttling going on. Now I just see the frankenphp process pegging the CPU.

Jul 29 '24 09:07 mathieudz

For sure the metric of worker utilisation is useful.

One of the first changes I make when testing is a timer that outputs channel length (after making the channel buffered) every couple of seconds, so I can tell when my test has saturated the buffers. I can see how this might be a general metric worth reporting. If that buffer gets full, latency shoots through the roof.

Now I just see the frankenphp process pegging the CPU.

If you see that, then traffic is too high and you need to scale up. CPU usage should be hanging around 30% for the lowest latency (at least on my workload). >60-70% means you need more cpus soon.

Also, make sure you don't have too many threads. Context switching can eat up all your cpu time, for no real benefit. If you see a high "steal" time in top, you have too many threads for your machine.

Jul 29 '24 17:07 withinboredom

Well usually I'm well below 20% and and steal is zero. For an unchanged request rate, the CPU occasionally goes through the roof, and so does the CPU temperature. I'd think it be a good idea to analyze it before throwing more hardware at it. Anyway, I'm not here to discuss my problem, only the fact that additional metrics are useful in order to investigate such problems.

Jul 29 '24 19:07 mathieudz

For an unchanged request rate, the CPU occasionally goes through the roof, and so does the CPU temperature.

Can you open a separate issue for this? That doesn't sound right, and it might be a bug or something.

Jul 29 '24 20:07 withinboredom

I've been looking at Caddy source, and correct me if I'm mistaken, but it looks like Caddy does not support modules adding custom metrics?

If im not mistaken, then I guess there is either two solutions, either add the support for a metric module in caddy, or make a custom route for frankenphp metrics?

Aug 08 '24 19:08 LauJosefsen

This is definitely possible to add custom metrics in the existing endpoint. I've done it for http://mercure.rocks: https://github.com/dunglas/mercure/blob/main/caddy/caddy.go#L30

Aug 08 '24 20:08 dunglas

What metrics would be useful to add?

num threads configured

num non-worker threads

num worker threads (foreach configured worker)

maybe as a start?

In my case, I need it for HPA, so for example, if 80% of workers are busy or invalid spin a new pod, so the total number of threads and amount of busy or inactive threads can be enough. Ideally, we can split them into non-worker threads, and threads for each worker. But for the beginning, the total number of threads and the number of busy or inactive threads can be enough.

Aug 10 '24 21:08 bogdandubyk

Busy over what time period or should it be instantaneous?

How is this not solved via the caddy_http_requests_in_flight metric?

Aug 10 '24 22:08 withinboredom

so you mean, if we for example have num_thread = 100, and caddy_http_requests_in_flight we can thread it as 80% of busy workers? maybe that can work but we also need to have also total amount of threads metric, is that already exposed?

Currently, with FPM we are using https://github.com/hipages/php-fpm_exporter like this:

max((100 / phpfpm_total_processes_gauge) * phpfpm_active_processes_gauge) by (<<.GroupBy>>)

Aug 10 '24 22:08 bogdandubyk

I'm def agreeing that we should have these new metrics, I'm just wondering how they would be different from caddy_http_requests_in_flight.

From some experimentation, caddy_http_requests_in_flight will be greater than the number of workers (meaning requests are queued up waiting for workers by caddy). So, I think these metrics can be quite useful.

Here's what I'm thinking:

frankenphp_[worker_name]_total_workers: for-each worker script defined, the total workers will be given. frankenphp_total_php_threads: the total number of php workers (this minus the above metric gives the number of threads available to handle cgi-mode requests). frankenphp_busy_threads: the total number of threads executing php code. Note: for workers, these will always count the worker script. So a dead worker will not be counted here. frankenphp_[worker_name]_busy_workers: the total number of workers executing a request

This should cover worker-only, cgi-mode, and mixed workloads. I think. Am I missing anything?

Aug 10 '24 23:08 withinboredom

Oh, one thing that may be of interest:

frankenphp_[worker_name]_non_request_time: the time spent between requests doing worker script stuff (GCing, resetting containers, etc). However, I'm not sure how provide that as a measure. Maybe accumulative time?

Aug 10 '24 23:08 withinboredom

@withinboredom is there any update on this? thank you.

Aug 26 '24 15:08 AlexOstrovsky

The last half of my months are usually dedicated to paid work, and the first half to open source projects. So, expect more movement next week.

Aug 28 '24 22:08 withinboredom

frankenphp frankenphp copied to clipboard

Metrics to evaluate number of workers

frankenphp
frankenphp copied to clipboard