go-grpc-prometheus
go-grpc-prometheus copied to clipboard
docs inconsistency: Histograms doesn't include grpc_code
Hi,
unlike the documentation in the README file, it appairs that the grpc_code is not included as label in the histogram.
grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)
// EnableHandlingTimeHistogram enables histograms being registered when
// registering the ServerMetrics on a Prometheus registry. Histograms can be
// expensive on Prometheus servers. It takes options to configure histogram
// options such as the defined buckets.
func (m *ServerMetrics) EnableHandlingTimeHistogram(opts ...HistogramOption) {
for _, o := range opts {
o(&m.serverHandledHistogramOpts)
}
if !m.serverHandledHistogramEnabled {
m.serverHandledHistogram = prom.NewHistogramVec(
m.serverHandledHistogramOpts,
[]string{"grpc_type", "grpc_service", "grpc_method"},
)
}
m.serverHandledHistogramEnabled = true
}
The grpc_code is not in the predefined labels.
Is there a way to add grpc_code, or maybe it would be nice to correct the README ?
Thanks
It'd be nice to hear @bwplotka's opinion as well, but I feel having the grpc_code
dimension here would be too high cardinality for too little gain.
Mixed feelings as well, as cardinality might be the problem for almost no gain. On the other hand IMO knowing latency of error request is useful, but not sure if exact info of timings between 202 and 200 or 504 between 500 matters... Probably in some specific cases matters (:
Also with grpc_code
here it will duplicate with things like grpc_server_handled_total
as histogram will give you that.
Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.
Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.
Agreed.
no i've not this particular need for now. I was interrested to test that as it's in the doc. My point was just to correct the doc to avoid potential new future issues on this subject :)
Cool, let's correct docs for now, thank You for finding this! Marking as a docs bug.
I think it's not unreasonable for succeeded and failed calls to behave quite differently; in other words failed calls often are outliers, latency wise. For example when a deadline is exceeded somewhere, failed calls will tend to be vastly slower than regular calls. In other cases, errors will be detected early (e.g. bad requests, failed preconditions) and thus artificially skewing the distribution towards a fat head.
A more specific example: If I see that the 95% percentile duration for a call is 1s, does this mean that it's sometimes slow but serving OK, or that those slow requests are the ones that hit some deadline exceeded errors downstream?
Looking at error ratio with grpc_server_handled_total can give me a hint, but it alone cannot fully answer that since the method could perform multiple downstream calls, some of which could have a very short deadline and thus cause the outer method to be counted in a small bucket with its grpc_code="deadline_exceeded"
EDIT: I'm not very concerned with the cardinality increase, after all the number of possible grpc error codes is bounded (unlike the method names for which we already have a label); But knowing whether it's "OK" vs "other" would already be useful.
IMO it should have a label with SUCCESS/FAILURE values at least in grpc_server_handling_seconds_bucket
bucket. It would allow to track SUCCESS calls latencies.
I'm also a bit surprised about this issue, for golden signals it is usually recommended to analyze latency for ok/error separately:
It’s important to distinguish between the latency of successful requests and the latency of failed requests.
The cardinality argument doesn't seem very strong given that there are only a small number of status codes. I would appreciate if this could be fixed.