common icon indicating copy to clipboard operation
common copied to clipboard

Potential bug in text parser

Open vikramraman opened this issue 5 years ago • 2 comments

We use the text parser within our application to scrape prometheus metrics and are facing an issue related to one described here: https://github.com/uber/cadence/issues/3120

cadence exposes two histogram metrics named signal_info and signal_info_count.

When parsing the HELP line for the signal_info_count the parser appears to strip the _count suffix and sets the p.currentMF property: https://github.com/prometheus/common/blob/f39dfa2b000545b3a763b844012958c275a62266/expfmt/text_parse.go#L688

This ends up setting the property to the earlier signal_info histogram leading to the parser error here: https://github.com/prometheus/common/blob/f39dfa2b000545b3a763b844012958c275a62266/expfmt/text_parse.go#L485

vikramraman avatar Aug 27 '20 23:08 vikramraman

I'm not sure we ever formally defined this, however overlapping metric names like this would not be recommended. As this is the official parser, if it rejects it then this is a true positive.

In this case I'd suggest generally improving the metrics. Info as part of the name could be confused with the info metric pattern, and the count could be confused with a summary/histogram itself - which is why you're getting an error. There's also no indication in the name exactly what is being counted (entries in some struct?) nor what the unit of the size is (bytes?).

brian-brazil avatar Aug 28 '20 07:08 brian-brazil

This is indeed intended behavior, at least "kind of"…

The code normalizes names of Summaries and Histograms (and their components) by pruning any of the "magic" suffixes. Then it reports the collision. On the part of the parser, the error message is confusing because the normalization is not reported.

However, if you use prometheus/client_golang to expose the metrics, the exposition will already fail. I assume the code in uber/cadence is rendering the exposition by its own means.

I totally understand your confusion because your two histograms actually don't create any direct name collisions in the text exposition. The reason to ban them nevertheless is that it would create a collision if you switched the type from Histogram to Summary (because the pre-calculated quantiles in the signal_info_count Summary would be called signal_info_count, colliding with the count component of the signal_info Summary (or even Histogram, FWIW).

On a more general note, I'm not a fan of the "magic" suffixes because handling collisions is as confusing and hairy as we see it right here. At least, the exact collision handling behavior needed to be spelled out. I guess/hope OpenMetrics will do exactly that. I would prefer to solve the problem for good, but apparently, that's not going to happen.

beorn7 avatar Sep 07 '20 20:09 beorn7