go-grpc-middleware icon indicating copy to clipboard operation
go-grpc-middleware copied to clipboard

Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric

Open jon-whit opened this issue 2 years ago • 8 comments

I'm using the prometheus middleware to measure the RPC endpoint latency for my gRPC service. I've enabled the ServerHandlingTimeHistogram.

I noticed that the grpc_code label is not included in the observed _bucket metrics (see this line of code). Consequently, I cannot effectively measure my p99 request latency for requests served successfully (e.g. where grpc_code="OK"). I'd like to set some SLOs based on percentiles of successful requests, but this middleware doesn't currently allow me to do that as far as I can tell.. Was this design choice intentional, and if so why?

Would you be open to me submitting a PR to add it?

jon-whit avatar Jul 28 '23 19:07 jon-whit

Related to https://github.com/grpc-ecosystem/go-grpc-prometheus/issues/75, but reviving since that repository is now deprecated.

jon-whit avatar Jul 28 '23 19:07 jon-whit

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

johanbrandhorst avatar Jul 28 '23 23:07 johanbrandhorst

Yea. I think with native histograms it would be easier decision. Wonder if it makes sense to be early adopter of native histograms here for this reason cc @beorn7

Otherwise we could add opt-in grpc_code, that's also fine.

bwplotka avatar Jul 31 '23 09:07 bwplotka

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

@johanbrandhorst I'm not sure I agree with that conclusion. I wouldn't say any consensus was yet achieved. There are still community members requesting it and the "consensus" was merely the response of the maintainers. But, in any case, why is it not possible to allow the client to choose whether to include it or not? For me the increase in cardinality is well worth the value I get from it. I should be able to enable this label if I want and understand that there is a tradeoff. For example, I should be able to

var serverMetrics *grpc_prometheus.ServerMetrics

serverMetrics.EnableHandlingTimeHistogram(
    ...,
    WithStatusCodeLabel(),
)  

If the WithStatusCodeLabel option is provided to the handling time histogram, then that label should be emitted and a value provided for it in each observation. The default can be to omit the grpc_code label, which maintains equivalent behavior with what is standard today, but if you want that label and are willing to increase cardinality, then that should be the developer's choice.

This work ☝️ can be in addition to switching to a native histogram so as to further reduce the impact/burden if you want to enable the grpc_code label.

If I submit a PR, will you guys consider merging it? If so, I'd be happy to submit one this week.

jon-whit avatar Aug 02 '23 16:08 jon-whit

It would be great to offer native histogram support, but I would recommend to also support classic histograms in parallel so that scraper that aren't ready for native histograms yet will still get a classic histogram.

beorn7 avatar Aug 08 '23 09:08 beorn7

There still seems to be interest from the community to add this (see #709), I'm not sure there was a conclusion from this discussion, I will defer to you on this @bwplotka.

johanbrandhorst avatar May 01 '24 02:05 johanbrandhorst

Hey! I'd also find this metrics useful because we're trying to measure our SLO burn rate and would like to set a success criteria based on success/failure and latency. If we use two separate counters, we'd have to double count messages that are both late, and errors.

I think it would be best to leave it up to the end users to decide whether they want to keep this label by dropping the labels when they scrape them.

Tatskaari avatar Jul 16 '24 15:07 Tatskaari