caddy icon indicating copy to clipboard operation
caddy copied to clipboard

Prometheus metrics are redundant and slow

Open renbou opened this issue 3 years ago • 43 comments

While digging into Caddy's source code I've noticed that every (!) route handler is wrapped in a metricsInstrumentedHandler which updates Prometheus metrics during request execution. While it is a great feature and should definitely be enabled by default in Caddy, it currently uses up way to much CPU time and the metrics provided are quite redundant.

Since Caddy tries out every route handler in order until it gets an answer to the request, metric instrumentation is called for each handler, even if it didn't actually partake (which is also quite difficult to define in this context) in resolving the request, so handler-specific metrics are constantly updated with unrelated data and as a result pretty much all of the handler-specific metrics are meaningless, making them only usable to track server-wide stats.

As an example, here are the metrics for a simple Caddy server with 2 hosts, 1 of which only has a reverse_proxy handler, and the other has 2 respond handlers, 1 reverse_proxy handler and 1 file_server handler. The metrics were taken after running an http load-testing tool on the endpoints. prometheus-metrics.txt As seen in the example, all of the handler-specific metrics are pretty much the same for all handlers, even though in reality only the respond handler was requested.

The handler metrics provide even less use if the web server hosts multiple domains, since requests from all domains get mixed up in the metrics. However, questionable metrics wouldn't be much of an issue if they were as fast as providing server-wide metrics, but they, of course, aren't, since they are getting updated multiple times until the request is finally answered.

I've ran pprof while putting load using h2load to request one of the simple respond handlers, and it turned out that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP). Here's the profile: profile.zip image

I really think that metrics such as these should be server-wide where they make more sense and are quicker. https://github.com/nginxinc/nginx-prometheus-exporter could be seen as an example of similar Prometheus metrics commonly used for nginx.

renbou avatar Mar 18 '22 05:03 renbou

I've looked through the other issues regarding Prometheus metrics and if host-based metrics might some day be implemented, they shouldn't affect performance that much. As for adding support for customizable metrics - seems like they would be an even bigger problem performance-wise

renbou avatar Mar 18 '22 05:03 renbou

Thanks for filing this @renbou - I'll try to find some time this coming week to dig into it. My gut feeling is that some of the issue is due to the way middleware is handled and nested by Caddy, so it might be a bit more tricky to fully fix.

On the surface this is alarming:

[...] that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP).

However percentages can sometimes be misleading, so it would be good to get some actual numbers... I haven't looked at the profile yet (and I won't this weekend - recovering from jet lag).

Caddy's handling of static responses in particular is very fast, and so it's not actually that surprising that the instrumentation surrounding it would account for more CPU time. As with all performance tuning, the most useful question isn't "how fast can we make it go?" but instead "how fast does it need to be?" In SRE terms that might translate to "we have a latency SLO of 100ms 99% of the time, but we can't reach that reliably if the metrics instrumentation takes 80ms"

hairyhenderson avatar Mar 19 '22 14:03 hairyhenderson

I might've solved how this and the other Prometheus issues/requests could be implemented/solved. As discussed previously in #4016, a new metrics component would need to be added to the server structure of the http app, where the wanted labels can be described somewhat like this in the Caddyfile:

metrics {
	label host {http.request.host}
	label content_type {http.response.header.Content-Type}
	label static_label "same value for everyone!"
}

The problem which surfaced previously is that we'd metrics would supposedly have to be lost on reloads. To solve this, we can globally store the currently existing prometheus.Collectors without labels (counters, GoCollector, etc) and prometheus.ObserverVecs (histograms), which are the entities performing actual metric collection.

When provisioning a server, we'll need to setup a new prometheus.Registry with the required metrics, which are:

  1. Global metrics which have not been removed
  2. New metrics as well as add the new metrics to the global ones (for next reloads). This way we aren't losing any metrics which haven't been changed, and the changed metrics will get lost during if we change labels anyway, since there isn't a clear way to migrate them.

When removing the old Metrics component during a reload, all we need to do is remove the "deprecated" metrics from the global store. To do so we can store a view on the metrics which were added to the current server's prometheus.Registry, which would then allow us to find the deprecated metrics by "xoring" the old server's metrics and the new global metrics.

To solve another set of issues including performance, needing information from the matchers (for example the host-matcher), we can simply leverage the already present replacer. Since, as mentioned in the beginning, labels should be configurable with placeholders, we could simply add the necessary values to the request replacer during any time of the request. For example, a host matcher could set the value for "http.server.host", the handler instrumentation (which is what currently affects performance by running the same metric instrumentation code for each route) could set the value for "http.server.handler", etc. All of the keys accumulated during the request could then be used during metric collection, which would happen one time and be one of the first handlers in the middleware chain (thus giving it access to ALL of the gathered placeholders), by simply getting them from the replacer.

The way of gathering metrics I've described here will also leave us with actually meaningful metrics about handlers (since the http.server.handler key will be set only once - after the handler which actually gave a response returns) and everything else, since they are gathered once per request instead of multiple times like they are now (which is why currently metrics are pretty useless).

renbou avatar Mar 25 '22 01:03 renbou

Worth noting that metrics is already a handler directive to serve the actual metrics exposition endpoint (see https://caddyserver.com/docs/caddyfile/directives/metrics), so unfortunately I think we'll need a different config for it. We could have a labels directive maybe:

labels [<field> <value>] {
	<field> <value>
}

I'm thinking we might want to support somekind of setup where you could say when to evaluate the labels, i.e. if you want to add a label at the start of the request vs at the end (after all the next middlewares ran). Could be done with > and < prefixes on the field, so like >host and <content-type or something to that effect (as in, > for "on the way in" and < for "on the way out"). But idk if this makes sense?

How would your described changes behave with existing metrics collected? My understanding was that the histograms etc may need to be reset on config reload or something if their config changed? I'm not sure I totally follow. @hairyhenderson should definitely chime in here.

francislavoie avatar Mar 25 '22 02:03 francislavoie

Currently metrics is a handler field in the JSON structure and the Caddyfile, however I'm talking about adding another metrics to the http.server in thr JSON structure (would be global in Caddyfile). I'm not sure if this will be ok with how the Caddyfile is parsed though.

As for when the metrics are collected what I meant was that the metrics handler would now be initialized only once as the first middleware in the chain, meaning that it'd start collecting metrics just before the request is handled and finish collecting them after, as well as use the replacer afterwards since it's filled with all of the needed placeholders at that point.

Existing metrics are reset only if their labels change, pretty much. If the labels change, then their isn't a proper way to "migrate" metrics with the old label space to the new one, because they act like totally different metrics. For example, when adding a host label, it wouldn't make sense to "continue using the old metric" and just append a new label to it - the results wouldn't make sense (also this is what can't be done using the prometheus library).

Actually, since currently Caddy's metrics to which we'd be adding the possibilities of label modification (pretty much request-specific histogram metrics) all depend on the same set of labels, we'd only need to keep a version of the labels which were used to provision the current server and the current labels as a global value, instead of having to upkeep multiple lists (Collectors, Observers...)

renbou avatar Mar 25 '22 06:03 renbou

http.servers JSON options would go in the Caddyfile's servers global options: https://caddyserver.com/docs/caddyfile/options#server-options

I'm still not sure I totally follow, but that's probably just my fault for not having looked into it enough to wrap my head around it. But it sounds like more of a plan than we've had before, so that's exciting!

francislavoie avatar Mar 25 '22 06:03 francislavoie

http.servers JSON options would go in the Caddyfile's servers global options: https://caddyserver.com/docs/caddyfile/options#server-options

Hmm, not all of them do - take http_port and https_port for example. Metrics can also be made global in the Caddyfile because it'd make more sense, I think.

This shouldn't be that hard to implement, but I'd like to contribute a few optimization-related changes since they should overall clean up the request-handling code a bit. But basically the way we won't be losing metrics is by using a new prometheus.Registry for each server (that actually collects the metrics and bundles them up for the http endpoint) but reuse the actual metrics between server runs (if the labels for those haven't changed, meaning that critical metrics would never be lost because they don't depend on that set of labels)

renbou avatar Mar 25 '22 06:03 renbou

Hmm, not all of them do - take http_port and https_port for example

Those two options are on the http app, not on http.servers.

Metrics can also be made global in the Caddyfile because it'd make more sense, I think.

Well, it depends whether it's config per server or config on the http app, and it sounds like you're suggesting per-server. Wouldn't it be possible/make sense to have different label config per server?

francislavoie avatar Mar 25 '22 07:03 francislavoie

Do you have an idea how #4140 could fit in maybe?

francislavoie avatar Mar 26 '22 02:03 francislavoie

You're right, those options are on the http app. And the metrics config should definitely be a per-server thing (http.servers in the json), but in a Caddyfile we have no way of specifying multiple servers, which is why it should probably be a part of the global options.

As for the reverse proxy upstreams - I really do not see the point in adding latency and status code metrics for upstreams, since that is already available through the present metrics. A health status metric can surely be added though, and per-path configuration should be possible with what I've described here: we could add a "metrics" structure to the routes (http.servers.routes[i]) which could be a metric inclusion/exclusion list for the route.

Considering the fact that a lot of metrics might be added over the time, we can also make the server-wide metrics config look like this:

metrics {
  labels {
    host {http.server.host}
  }
  include {
    reverse_proxy {
      upstream {http.handlers.reverse_proxy.upstream}
    }
  }
  exclude {
    request_size_bytes
  }
}

which would allow enabling/disabling various metrics globally. Though we'd definitely have to think this design through a bit more, lol, since this doesn't look very nice.

renbou avatar Mar 28 '22 06:03 renbou

but in a Caddyfile we have no way of specifying multiple servers, which is why it should probably be a part of the global options

We do though; take a look again at how the servers global option works, it maps directly to the servers in the JSON, by applying to the listeners of each server as the pivot https://caddyserver.com/docs/caddyfile/options#server-options

In case it wasn't clear, I was talking about whether metrics should go inside the servers global option, or whether it should be its own global option (beside servers). I think inside servers is probably best because you get some nice properties from that for free (like servers without any argument applying to all servers regardless of port being used)

Considering the fact that a lot of metrics might be added over the time, we can also make the server-wide metrics config look like this:

Hmm, interesting. Yeah, we'll probably need a proof-of-concept implementation to see how that feels and whether it actually works 👍

Thanks for looking into this!

francislavoie avatar Mar 28 '22 06:03 francislavoie

Ah, okay. I wonder why some server options, such as atomatic https, are added as global ones in the Caddyfile then...

renbou avatar Mar 28 '22 06:03 renbou

Because servers was added later, and some of those don't really make sense to configure per server (we couldn't think of a valid usecase where you'd want to configure different auto-https options per server).

francislavoie avatar Mar 28 '22 07:03 francislavoie

@renbou is on to something.

I was just sent a goroutine dump from a customer, and noticed that the metrics middleware is wrapping the response for every middleware that is chained in:

#	0x6a1875	net/http.(*http2serverConn).writeHeaders+0x175								GOROOT/src/net/http/h2_bundle.go:5883
#	0x6a2abb	net/http.(*http2responseWriterState).writeChunk+0x77b							GOROOT/src/net/http/h2_bundle.go:6149
#	0x6a209d	net/http.http2chunkWriter.Write+0x1d									GOROOT/src/net/http/h2_bundle.go:6063
#	0x4b7601	bufio.(*Writer).Flush+0x61										GOROOT/src/bufio/bufio.go:607
#	0x6a2f2e	net/http.(*http2responseWriter).Flush+0x2e								GOROOT/src/net/http/h2_bundle.go:6262
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0x12d5a3b	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*maxLatencyWriter).Write+0x1db		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:226
#	0x12d56fe	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.Handler.copyBuffer+0x31e			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:187
#	0x12d51fa	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.Handler.copyResponse+0x2fa		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:163
#	0x12d01fb	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).reverseProxy+0xfbb		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/reverseproxy.go:723
#	0x12ce494	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).ServeHTTP+0xe94		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/reverseproxy.go:464
#	0xf7725e	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x51e		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/metrics.go:133
#	0xf7d09a	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x3a				external/com_github_caddyserver_caddy_v2/modules/caddyhttp/routes.go:272
#	0xf6d82e	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x2e				external/com_github_caddyserver_caddy_v2/modules/caddyhttp/caddyhttp.go:57

The offending code is routes.go, line 260:

func wrapMiddleware(ctx caddy.Context, mh MiddlewareHandler) Middleware {
	// wrap the middleware with metrics instrumentation
	metricsHandler := newMetricsInstrumentedHandler(caddy.GetModuleName(mh), mh)

This will result in excessive allocations and a relatively serious performance penalty. @hairyhenderson I know you're busy, but do you think we can escalate this? We will need to fix this or remove this before the 2.5 release to ensure stability and reliability.

(It's my bad for not catching this in code review, but it was a large change, and I didn't realize the extent at which it was creating new response recorders.)

mholt avatar Mar 29 '22 18:03 mholt

@mholt I've described a way to fix this as well as add various requested features related to the metrics in the discussion above, and began thinking of the way to actually implement this. I was planning to contribute a few other optimization-related PR's first, but if 2.5 is coming up soon (I haven't really seen a deadline), I can start working on this instead (probably closer to the weekend though)

renbou avatar Mar 29 '22 18:03 renbou

I am a little behind on things. Sorry I haven't gotten around to reading your latest posts in detail yet!

I suspect that moving the newMetricsInstrumentedHandler out of the middleware-chaining loop would be a good solution in terms of fixing the redundancy & performance problems. If that breaks the desired metrics functionality, then the initialization of metrics instrumentation may need to be split into two parts, so that the response is only "recorded" (yes I know it's not buffering the response, but still) once.

We don't have a firm deadline for 2.5, but soon-ish would be nice. :) Was kind of my hope to get it out at the beginning of April.

mholt avatar Mar 29 '22 18:03 mholt

The proposed solution should instrument the handler chain with the metrics middleware only once, globally, and the handler-specific information is going to be preserved by wrapping the handlers with a middleware which simply sets info about the handler (i.e. once the first time a handler returns its wrapper would set values in the request context signifying the name of the handler)

renbou avatar Mar 29 '22 18:03 renbou

This still needs to be fixed soon, but I don't think it will be done by the 2.5.1 release; bumping back to 2.5.2.

mholt avatar May 06 '22 14:05 mholt

@renbou Is this still something you'd be interested in working on?

@hairyhenderson Any chance we could get your help on this one? Otherwise the fix may be less than optimal for Prometheus/Grafana users.

mholt avatar Sep 15 '22 20:09 mholt

Here's a simple comparison of serving a static file with and then without metrics:

image

(config is literally just :1234 \n file_server and index2.html is the Caddy website homepage)

Without metrics as they are currently implemented, Caddy can be about 12-15% faster. (For this test, I simply disabled the wrapping of each handler with the metrics handler.)

mholt avatar Sep 15 '22 21:09 mholt

FWIW we serve metrics out of a different port than we use for proxying/file server, so we hope that that usecase can still be supported with however this is implemented in the future. IE, if you have to use the metrics directive to opt-in to metrics collection it should be possible to not require a path as well so that those metrics can be served from a different matcher with for example metrics /metrics

ReillyBrogan avatar Oct 04 '22 18:10 ReillyBrogan

FWIW we serve metrics out of a different port than we use for proxying/file server, so we hope that that usecase can still be supported with however this is implemented in the future.

It definitely will, we have no plans to change that :+1:

The metrics directive is just about serving a metrics endpoint. If we introduce a directive for enabling metrics, it would probably be a different one (though the naming is unfortunate).

francislavoie avatar Oct 04 '22 20:10 francislavoie

Any updates on that or maybe some ETA?

asychev avatar Jan 18 '23 14:01 asychev

We haven't heard from @renbou in quite some time. I feel like the last we heard of them roughly coincides with the war in Ukraine (their profile mentions Saint-Petersburg, Russia) around March/April 2022. :thinking:

francislavoie avatar Jan 19 '23 02:01 francislavoie

@renbou any chance you'll continue the work on this?

francislavoie avatar Apr 13 '23 14:04 francislavoie

Hi, guys! I'm really sorry about my awful response time and the "abandoned" state I've left this issue with. Turns out that working a full-time job and studying for a bachelor's doesn't leave you with much free time (lol). I'd be really glad to take this issue as well as some of the other optimizations I've implemented in a different PR and refactor/complete them.

I think we need to discuss, however, what is actually expected to be done in regards to the metrics. The last time I've tried working on this, it wasn't really clear how we want metrics to work and look in the config. Also, I've tried to bite off more than I could chew in a single sprint of changes (implement reloadable metrics, configurable metrics, etc), so I think we need to decide what should be prioritized right now.

The main issue is the performance, right? I'll check out the changes which have occurred in the codebase and see what can be optimized. Then we can probably move on to more feature-related metrics tasks which need doing

renbou avatar May 04 '23 11:05 renbou

Hi @renbou!!! Great to hear from you!

Hope your work and school is going well. I definitely know how that goes. What are you studying?

Would you like an invite to our developers Slack? That might make some back-and-forth discussion easier :+1: Edit: You're already in our Slack :sweat_smile: Want to pop in and say hi? We can discuss things there if you prefer, or just here. Either way.

mholt avatar May 04 '23 14:05 mholt

Hello!
Has there been any progress on this matter? Since you mentioned continuing the discussion over on a Slack channel, I'm afraid stuff hasn't been shared it yet.

Tmpod avatar Jul 28 '23 22:07 Tmpod

Is this completely abandoned?

asychev avatar Feb 06 '24 10:02 asychev

@asychev The core Caddy team doesn't have any experience with metrics, so we're not equipped to make any progress. We need help.

francislavoie avatar Feb 06 '24 10:02 francislavoie