go-grpc-prometheus docs inconsistency: Histograms doesn't include grpc

Hi,

unlike the documentation in the README file, it appairs that the grpc_code is not included as label in the histogram.

grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)

// EnableHandlingTimeHistogram enables histograms being registered when
// registering the ServerMetrics on a Prometheus registry. Histograms can be
// expensive on Prometheus servers. It takes options to configure histogram
// options such as the defined buckets.
func (m *ServerMetrics) EnableHandlingTimeHistogram(opts ...HistogramOption) {
	for _, o := range opts {
		o(&m.serverHandledHistogramOpts)
	}
	if !m.serverHandledHistogramEnabled {
		m.serverHandledHistogram = prom.NewHistogramVec(
			m.serverHandledHistogramOpts,
			[]string{"grpc_type", "grpc_service", "grpc_method"},
		)
	}
	m.serverHandledHistogramEnabled = true
}

The grpc_code is not in the predefined labels.

Is there a way to add grpc_code, or maybe it would be nice to correct the README ?

Thanks

Apr 09 '19 09:04 NBR41

It'd be nice to hear @bwplotka's opinion as well, but I feel having the grpc_code dimension here would be too high cardinality for too little gain.

Apr 09 '19 11:04 brancz

Mixed feelings as well, as cardinality might be the problem for almost no gain. On the other hand IMO knowing latency of error request is useful, but not sure if exact info of timings between 202 and 200 or 504 between 500 matters... Probably in some specific cases matters (:

Also with grpc_code here it will duplicate with things like grpc_server_handled_total as histogram will give you that.

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

Apr 09 '19 14:04 bwplotka

Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway.

Agreed.

Apr 09 '19 15:04 brancz

no i've not this particular need for now. I was interrested to test that as it's in the doc. My point was just to correct the doc to avoid potential new future issues on this subject :)

Apr 10 '19 07:04 NBR41

Cool, let's correct docs for now, thank You for finding this! Marking as a docs bug.

Apr 10 '19 08:04 bwplotka

I think it's not unreasonable for succeeded and failed calls to behave quite differently; in other words failed calls often are outliers, latency wise. For example when a deadline is exceeded somewhere, failed calls will tend to be vastly slower than regular calls. In other cases, errors will be detected early (e.g. bad requests, failed preconditions) and thus artificially skewing the distribution towards a fat head.

A more specific example: If I see that the 95% percentile duration for a call is 1s, does this mean that it's sometimes slow but serving OK, or that those slow requests are the ones that hit some deadline exceeded errors downstream?

Looking at error ratio with grpc_server_handled_total can give me a hint, but it alone cannot fully answer that since the method could perform multiple downstream calls, some of which could have a very short deadline and thus cause the outer method to be counted in a small bucket with its grpc_code="deadline_exceeded"

EDIT: I'm not very concerned with the cardinality increase, after all the number of possible grpc error codes is bounded (unlike the method names for which we already have a label); But knowing whether it's "OK" vs "other" would already be useful.

Sep 07 '20 15:09 mkmik

IMO it should have a label with SUCCESS/FAILURE values at least in grpc_server_handling_seconds_bucket bucket. It would allow to track SUCCESS calls latencies.

Apr 13 '22 17:04 ashu82492

I'm also a bit surprised about this issue, for golden signals it is usually recommended to analyze latency for ok/error separately:

It’s important to distinguish between the latency of successful requests and the latency of failed requests.

The cardinality argument doesn't seem very strong given that there are only a small number of status codes. I would appreciate if this could be fixed.

May 19 '22 15:05 gmolau

go-grpc-prometheus
go-grpc-prometheus copied to clipboard

docs inconsistency: Histograms doesn't include grpc_code

go-grpc-prometheus go-grpc-prometheus copied to clipboard

docs inconsistency: Histograms doesn't include grpc_code

go-grpc-prometheus
go-grpc-prometheus copied to clipboard