client_golang icon indicating copy to clipboard operation
client_golang copied to clipboard

Go metrics go_memstats_alloc_bytes and go_memstats_alloc_bytes_total collide in OpenMetrics and are still both exposed

Open hairyhenderson opened this issue 4 years ago • 12 comments

I received a bug report https://github.com/caddyserver/caddy/issues/3940 which seems to be caused by 2 metrics exposed by the Go collector: go_memstats_alloc_bytes and go_memstats_alloc_bytes_total - in effect, these names seem to be ambiguous with respect to the OpenMetrics "magic" _total suffix.

This seems to be related to https://github.com/prometheus/common/issues/253#issuecomment-688499208, though the conclusion in that issue seems to be "it's a bad idea, but won't break anything".

The metrics in question are defined here: https://github.com/prometheus/client_golang/blob/6007b2b5cae01203111de55f753e76d8dac1f529/prometheus/go_collector.go#L88..L104

I'm not super certain how to resolve this without removing or renaming one of the metrics, given that the python client seems to be behaving correctly with respect to the OpenMetrics spec (specifically https://github.com/OpenObservability/OpenMetrics/blob/master/specification/OpenMetrics.md#suffixes).

At the very least there seems to be an inconsistency between the Python and Go implementations...

LMK if there's anything I'm missing - on vacation so my brain isn't running in high-gear 😉

hairyhenderson avatar Dec 30 '20 13:12 hairyhenderson

Your issue here is that you tried to parse Prometheus text format with the OpenMetrics parser, so there is no bug here.

However, one of these will need a rename when client_golang gains full OpenMetrics support in a future version.

brian-brazil avatar Dec 30 '20 14:12 brian-brazil

Your issue here is that you tried to parse Prometheus text format with the OpenMetrics parser, so there is no bug here.

🤔 My impression was that setting the Accept: application/openmetrics-text; version=0.0.1 header when the promhttp.HeaderOpt EnableOpenMetrics option is set to true would produce output in the OpenMetrics text format (the Content-Type response header certainly implies it).

Looks like the python client isn't relevant to this issue - seems the Go parser behaves the same way:

Here's some test code that demonstrates the issue:

package main

import (
	"fmt"
	"net/http"
	"net/http/httptest"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/prometheus/common/expfmt"
)

func main() {
	h := promhttp.InstrumentMetricHandler(prometheus.DefaultRegisterer,
		promhttp.HandlerFor(prometheus.DefaultGatherer, promhttp.HandlerOpts{EnableOpenMetrics: true}),
	)
	srv := httptest.NewServer(h)
	defer srv.Close()

	req, _ := http.NewRequest(http.MethodGet, srv.URL, nil)
	req.Header.Set("Accept", "application/openmetrics-text; version=0.0.1")

	resp, _ := srv.Client().Do(req)

	fmt.Printf("%s %d\n", resp.Proto, resp.StatusCode)
	fmt.Printf("Header:\n")
	for k, v := range resp.Header {
		fmt.Printf("\t%s: %v\n", k, v)
	}

	parser := &expfmt.TextParser{}
	_, err := parser.TextToMetricFamilies(resp.Body)
	if err != nil {
		panic(err)
	}
}

When I run I see:

$ go run .
HTTP/1.1 200
Header:
        Content-Type: [application/openmetrics-text; version=0.0.1; charset=utf-8]
        Date: [Wed, 30 Dec 2020 16:26:08 GMT]
panic: text format parsing error in line 19: second HELP line for metric name "go_memstats_alloc_bytes"

goroutine 1 [running]:
main.main()
        /Users/hairyhenderson/go/src/github.com/hairyhenderson/temp/main.go:34 +0x5d1
exit status 2

However, one of these will need a rename when client_golang gains full OpenMetrics support in a future version.

My impression based on the above is that even the partial OpenMetrics support present now is incompatible with the output that the Go collector produces... I'm confused now about how this works when Prometheus scrapes this output - though perhaps it doesn't use common/expfmt for parsing?

hairyhenderson avatar Dec 30 '20 16:12 hairyhenderson

That code example is using the Prometheus text format parser with OpenMetrics input, which can also run into issues. This still doesn't indicate a bug, as you're conflating the two formats.

brian-brazil avatar Dec 30 '20 17:12 brian-brazil

While your examples don't directly demonstrate it, there is an issue there in the OpenMetrics output:

# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 647440.0
# HELP go_memstats_alloc_bytes Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes counter
go_memstats_alloc_bytes_total 647440.0

as there's a repeated metric family name.

brian-brazil avatar Dec 30 '20 17:12 brian-brazil

That code example is using the Prometheus text format parser

Ah - I think I follow... common/expfmt's TextParser.TextToMetricFamilies is parsing for Prometheus format explicitly? My impression was that it would parse according to the OpenMetrics rules given OpenMetrics negotiation.

Is there a Go equivalent for the python prometheus_client.openmetrics.parser? I presume that is an OpenMetrics-format parser?

Also, to answer my earlier confusion about how Promtheus itself works - it has its own parser which doesn't seem to care about duplicate names.

While your examples don't directly demonstrate it, there is an issue there in the OpenMetrics output:

Isn't that why the error text format parsing error in line 19: second HELP line for metric name "go_memstats_alloc_bytes" is given?

That seems to be caused by https://github.com/prometheus/common/blob/master/expfmt/openmetrics_create.go#L92, where the _total suffix is stripped for the HELP line.

hairyhenderson avatar Dec 30 '20 20:12 hairyhenderson

Ah - I think I follow... common/expfmt's TextParser.TextToMetricFamilies is parsing for Prometheus format explicitly?

Yes. A full Go OpenMetrics parser has yet to be written.

Isn't that why the error text format parsing error in line 19: second HELP line for metric name "go_memstats_alloc_bytes" is given?

Kinda, trying to parse one format as another format isn't exactly wise and something else may break instead. The original report would have failed for other reasons for example, as typical Prometheus format output is not valid OpenMetrics output..

brian-brazil avatar Dec 30 '20 20:12 brian-brazil

Parser shenanigans aside, the actual issue here is that the standard metrics exposed by the Go collector are fine with the old Prometheus formats, but run into a known issue with OpenMetrics: Since OpenMetrics shaves off the _total from a counter's name when it comes to the metric family name, there is a collision if there is also a gauge with the same name as the counter's name without the _total. This was always well understood, but only just now I realize that we have an instance of this in the metrics almost every instrumented Go binary exposes. That changes the problem from an edge case to a highly relevant one.

For full OpenMetrics support, we need to change the metric name. But what to do right now? The current preliminary OpenMetrics implementation serves invalid OpenMetrics format. Interestingly, Prometheus is "fine" with it: It ingests both metrics. For the metadata API, it drops the gauge and keeps the counter (which I guess is just because the counter comes later in the exposition and overrides the gauge).

We could just leave things as is, accepting that the OpenMetrics format exposed is technically invalid (while Prometheus currently handles it in a way that at least doesn't break anyone when activating OpenMetrics).

Or we could fix the format by either not exposing one of the metrics or by renaming one of the metrics whenever OpenMetrics is negotiated. However, this would break everyone that relies on the metrics as they are now.

beorn7 avatar Dec 31 '20 21:12 beorn7

We could for now always expose counters as unknown for OM, or being smarter about when we have them as counters rather than unknown.

brian-brazil avatar Dec 31 '20 21:12 brian-brazil

It made me quite happy that finally tools like Grafana start to make use of metadata in tooltips etc. Obfuscating counters as unknown is not ideal for them.

beorn7 avatar Jan 02 '21 18:01 beorn7

There is the option of only to switching to unknown if there is a collision.

brian-brazil avatar Jan 04 '21 09:01 brian-brazil

Yes, that's what I had in mind. It's still not ideal… :disappointed:

beorn7 avatar Jan 04 '21 10:01 beorn7

Yeah, but I think it's the best we can do currently.

We might also want to decide on a new name for one of the metrics now and implement it as a duplicate, as that'll make the future transition easier. alloc->allocated maybe?

brian-brazil avatar Jan 04 '21 10:01 brian-brazil