waitress icon indicating copy to clipboard operation
waitress copied to clipboard

Ability to expose how loaded waitress is

Open robertlagrant opened this issue 6 years ago • 25 comments

Is it possible to grab Waitress metrics (how many channels currently active, how many threads active, how many TCP connections open, for example)?

Is this currently possible? If not, I think it'd be a good feature :-)

(My plan is to shove them in prometheus_client, which can compute averages, monitor for high watermark breaches and trigger Kubernetes pod scaling, etc.)

robertlagrant avatar Jan 15 '18 09:01 robertlagrant

No, waitress at this point in time doesn't make any of that sort of information available.

digitalresistor avatar Jan 16 '18 02:01 digitalresistor

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

NicolasLM avatar Jan 19 '18 14:01 NicolasLM

Waitress will currently log when there are more requests in the queue than there are threads to process said requests. It is logged under waitress.queue. Additional information and or prometheus support or things of that nature require someone to actively investigate how to implement that in waitress and the performance impact thereof.

digitalresistor avatar Apr 23 '19 03:04 digitalresistor

It might not be a terrible idea to allow people to interrogate for this info via a signal handler and stdout/stderr and let people build on that.

mcdonc avatar Apr 23 '19 03:04 mcdonc

It would be a great idea. Also would suggest adding traceback for all threads.
Back in the Old World you could send signal USR1 to ZServer and it would dump out to stdout traceback adding the other metrics would be nice bonus.

runyaga avatar Dec 19 '19 17:12 runyaga

How about making prometheus_client an optional dependency and serving the (configurable) /metrics endpoint (or maybe on a different port) if the package is available?

nsballmann avatar Oct 05 '21 15:10 nsballmann

Is safe to check the len(server.task_dispatcher.queue) to monitor the queue length? If so, is possible to add the metric:

d = Gauge('waitress_task_queue_length', 'Waitress task queue length')
d.set_function(lambda: len(server.task_dispatcher.queue))

How to get the number of connections? The Gauge.set_function method can be very useful to monitor the waitress.

robsonpeixoto avatar Nov 23 '21 23:11 robsonpeixoto

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

robsonpeixoto avatar Nov 24 '21 03:11 robsonpeixoto

It doesn't seem like there is consensus on the technology to use to report metrics. One way to tackle the problem could be to not pick any technology but instead provide a set of signals users could use. By signals I mean something like blinker signals, not unix signals.

Waitress would only need to define a set a signals like "new connection", "started task", "finished task"... and users would handle these signals if they want to derive metrics useful to them. Popular integrations like statsd or Prometheus could be created in a contrib part of waitress or even in a separate repository because signals would act as an API between core waitress and the integration.

This is the approach used by Flask, Django, Celery and many more to provide developers a way to integrate and extend the frameworks.

For a very down to earth example on how this could be used to provide metrics, see how I used signals in a project of mine to create an integration with Datadog.

NicolasLM avatar Nov 24 '21 09:11 NicolasLM

Here is a simple monkey patch solution I have used. https://gist.github.com/pauls-username/56744f5bd43b531a348bb59c36541cd7

There are probably more metrics that you could add using this method. The key is getting the server instance. Maybe adding some sort of a hook to get the server instance (without using monkey patching) would be a good official solution?

pauls-username avatar Dec 10 '21 20:12 pauls-username

We really should have an optional export of open metrics for Waitress.

That said, the question of whether to serve the /metrics handler in the same HTTP port and using the same Waitress threads... is tricky. If your Waitress is overloaded — kind of the moment when you need metrics the most — your Prometheus instance won't be able to scrape anything, because the Prometheus instance is sharing the same inbound queue as everything else. That's not good.

I like the monkeypatch code sample a lot, but — seconding what the previous commenter said — the code should also start a separate HTTP server using the Prometheus routine to serve /metrics on a different port.

Without metrics, Waitress is unobservable and that does not bode well fro production.

Rudd-O avatar Dec 28 '21 00:12 Rudd-O

Practically, this is not a problem. Monitoring is not for the point in time where the server doesn't serve anymore. Then, you know that already. That's the reason you are looking at the monitoring data in the first place. Monitoring data is an indicator what may be the reason and how fast you got there.

Yes, the /metrics handler will consume resources (CPU, Mem, Queue-entries) while serving. But this just means, it's overloaded earlier. But the trade-off is, by spending those resources, you get some insight which resources are consumed most and need most to be added (if possible).

In most implementations it doesn't matter if its in the same HTTP server or a different one. They consume the same resources. The overhead of a different process is probably higher. Serving it in the same process Pythons GIL may play a role in the game.

So my personal assessment is: Metrics consuming a small part of the resources is better than to have no metrics at all.

nsballmann avatar Jan 14 '22 08:01 nsballmann

Agreed on the assessment — note that serving on a different port allows metrics to still be collected if the main server is overloaded in the specific sense that the main port has an enormous queue.

Rudd-O avatar Jan 31 '22 14:01 Rudd-O

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

I don't quite understand why this is the case. Can't Prometheus expose a format like that? I'm thinking of https://prometheus.io/docs/tutorials/understanding_metric_types/#gauge

robertlagrant avatar Oct 27 '22 13:10 robertlagrant

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

FYI you can create and serve Prometheus metrics without depending on any library. They're simple lines of text your program serves on a port.

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

It is viable, you just have to serve them on a different port, and possibly with a priority queue, or maybe on a separate thread entirely.

For the record, this is standard practice at Google. You don't serve your production traffic on the same port as your metrics traffic, because otherwise you don't get any metrics when you are swamped in production. Different buses for different purposes.

statsd

That's push, and therefore it's pretty hard to get right (many corner cases with push based solutions, especially on dynamic service stacks). You miss metrics sometimes, sometimes you don't know that you are supposed to be getting metrics, et cetera. With systems like Prometheus, if the service isn't giving you metrics, you will get a metric that will fire an alert telling you that the service is down, absolutely no exceptins.

Rudd-O avatar Oct 28 '22 00:10 Rudd-O

FYI I am using Prometheus today to monitor Plone (waitress-based) servers, and this works quite fine to obtain Zope metrics (of course, until the server is swamped, but then I also have Varnish and NginX metrics to tell me the other aspects of the story).

Rudd-O avatar Oct 28 '22 00:10 Rudd-O

waitress is interested in exposing a hook feature that folks could tie into prometheus or whatever library/system they want but the issue has been (in order of priority highest to lowest):

  • what exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing
  • how to allow apps to register the hooks (an api to serve() and cli args)
  • avoid adding dependencies added to waitress

mmerickel avatar Oct 28 '22 01:10 mmerickel

I don't have a particular opinion on issues 2 and 3, but on issue 1 I think we need at least:

  1. Bytes sent / received by sockets.
  2. Requests per second by method.
  3. Listen queue depth count.
  4. Active request in flight count.

if not more and in histogram form.

On the Prometheus side this is usually accomplished by directly embedding calls to the respective metrics accumulators in opportune places within the code. With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

I understand that waitress the project does not want to gain this dependency tho. But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity. Hooks may just be too low-level. Perhaps there is a way to implement this all using Zope-style adapters / interfaces, have waitress call the right methods of such code at the right times, and then the implementer can register the right adapter class (Prometheus / statsd / collectd) which in turn concretely implements listening on a socket and serving /metrics (or pushing data somewhere), and maintaining the counter / gauge data.

Rudd-O avatar Nov 01 '22 10:11 Rudd-O

I really think hooks/signals is a great way to solve the problem at hand.

With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

But there are dozens of monitoring libraries and products. I get that Prometheus is popular right now but so is Datadog and statsd. Obliviously waitress cannot support all that in the long run. So it is better to expose low-levels hooks and let others use them to create extensions for each monitoring product.

But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity.

I don't think it is that hard of a task, as @mmerickel said, most of the HTTP metrics should really be computed by WSGI middlewares. Waitress should really focus on the low-level stuff that is not accessible any place else.

What exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing

With hooks waitress doesn't really have to compute any metrics. Instead it lets the subscriber know that something happened (eg. new connection in queue, connection out of queue...). The subscriber is then responsible for computing the metrics it wants with that information.

How to allow apps to register the hooks (an api to serve() and cli args)

I don't want to beat a dead horse but signals, a la blinker, solve this problem quite elegantly by decoupling publisher from subscriber. Moreover it is a common pattern used by tons of high profile libraries (Flask, Django, Celery...) that most Python developers are already familiar with.

Avoid adding dependencies added to waitress

I get that, for signals there are two solutions:

  • Conditional import of the signal library, this is the strategy used by Flask. It requires to pip install blinker to use some extensions. Waitress could do the same: when the signal lib is not installed, signals are not sent.
  • Vendorize the signal library, this is the strategy used by Django and Celery. Signal libraries are really a few hundred lines of code, some of them exist with permissive licenses.

NicolasLM avatar Nov 01 '22 10:11 NicolasLM

I would favor that approach too.

Rudd-O avatar Nov 01 '22 10:11 Rudd-O

Maintainers, if you are willing to entertain the idea of solving this problems via signals, I can get started on a PR to show how it would look like.

NicolasLM avatar Nov 01 '22 11:11 NicolasLM

@NicolasLM I'd love it if you got into this. The conversation above is correct that it impacts quite a lot of the codebase... if you wanted to propose the general API first it would be great for buy-in before you have to get too deep into things.

My primary concerns are:

  • running multiple wsgi servers on different ports, and how to disambiguate events between them
  • what parts need to be global versus per-server
  • how to configure it from serve() invocation
  • does it have any features when launched from the CLI? I would lean toward not, and if you want metrics you should wrap the serve() api and start up your prometheus endpoint next to it.

Basically all of those points are related to the API and what's global vs per-server and limiting the global part. :-)

Maybe it's as simple as defining an interface and then passing an adapter into serve() such as serve(listen='*', monitor=MyCoolPrometheusMonitor(prom_server)).

I'm not sure what prior art exists here in other wsgi servers, I think aiohttp has some when I looked a while ago.

mmerickel avatar Nov 01 '22 19:11 mmerickel

IT just occurred to me that I would also like to have the connection count metric (as well as the queue count) simply because that lets me do clean blue/green deployment — I won't take down a waitress-based backend (which has already been removed from the frontend proxy) until the connection count reduces to 1.

Rudd-O avatar Nov 05 '22 12:11 Rudd-O