go-grpc-middleware
go-grpc-middleware copied to clipboard
Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric
I'm using the prometheus middleware to measure the RPC endpoint latency for my gRPC service. I've enabled the ServerHandlingTimeHistogram.
I noticed that the grpc_code label is not included in the observed _bucket metrics (see this line of code). Consequently, I cannot effectively measure my p99 request latency for requests served successfully (e.g. where grpc_code="OK"). I'd like to set some SLOs based on percentiles of successful requests, but this middleware doesn't currently allow me to do that as far as I can tell.. Was this design choice intentional, and if so why?
Would you be open to me submitting a PR to add it?
Related to https://github.com/grpc-ecosystem/go-grpc-prometheus/issues/75, but reviving since that repository is now deprecated.
Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.
Yea. I think with native histograms it would be easier decision. Wonder if it makes sense to be early adopter of native histograms here for this reason cc @beorn7
Otherwise we could add opt-in grpc_code, that's also fine.
Sounds like the consensus in the other thread is that adding
grpc_codewould be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.
@johanbrandhorst I'm not sure I agree with that conclusion. I wouldn't say any consensus was yet achieved. There are still community members requesting it and the "consensus" was merely the response of the maintainers. But, in any case, why is it not possible to allow the client to choose whether to include it or not? For me the increase in cardinality is well worth the value I get from it. I should be able to enable this label if I want and understand that there is a tradeoff. For example, I should be able to
var serverMetrics *grpc_prometheus.ServerMetrics
serverMetrics.EnableHandlingTimeHistogram(
...,
WithStatusCodeLabel(),
)
If the WithStatusCodeLabel option is provided to the handling time histogram, then that label should be emitted and a value provided for it in each observation. The default can be to omit the grpc_code label, which maintains equivalent behavior with what is standard today, but if you want that label and are willing to increase cardinality, then that should be the developer's choice.
This work ☝️ can be in addition to switching to a native histogram so as to further reduce the impact/burden if you want to enable the grpc_code label.
If I submit a PR, will you guys consider merging it? If so, I'd be happy to submit one this week.
It would be great to offer native histogram support, but I would recommend to also support classic histograms in parallel so that scraper that aren't ready for native histograms yet will still get a classic histogram.
There still seems to be interest from the community to add this (see #709), I'm not sure there was a conclusion from this discussion, I will defer to you on this @bwplotka.
Hey! I'd also find this metrics useful because we're trying to measure our SLO burn rate and would like to set a success criteria based on success/failure and latency. If we use two separate counters, we'd have to double count messages that are both late, and errors.
I think it would be best to leave it up to the end users to decide whether they want to keep this label by dropping the labels when they scrape them.