prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

Mismatching value type between monitored project descriptors

Open grzzboot opened this issue 2 years ago • 2 comments

Hi! We've started using the GMP as soon as it came into public preview and it feels like a good fit for our needs. So thank you for your work!

We have now encountered a strange error that occurs with some of our service metrics. What happens is that the query for the metrics in question completely stops working and we get this error:

invalid parameter "query": substitute queries failed: convert vector selector failed: mismatching value type between monitored project descriptors: DISTRIBUTION for  vs DOUBLE for 

The failing metrics are the standard Spring Boot/Micrometer http_server_requests_seconds_count and http_server_requests_seconds_sum. Other metrics works fine from what we have observed.

We tried to disable scraping of all our services, even the ones that previously had worked, but that had no effect. It was still impossible to query for those two metrics. After some desperate hours of searching without finding any hints we decided to simply delete the metricDescriptors for these metrics using the Monitoring API. Then, instantly, the query started working again.

We started enabling scraping, service-by-service, again until the problem suddenly came back. So we could identify the services that caused the issue but we honestly don't have clue why they cause a problem? It doesn't seem like something that should be an issue...

The only difference we can see between the working set of services and the non-working ones is that they produce slightly different metric sets for http_server_request_seconds*-metrics.

Working:

# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.6",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.7",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.8",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.9",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.95",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.99",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.999",} 0.109051904
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.01",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.05",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.1",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.2",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.4",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.6",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.8",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="1.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="1.5",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="2.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="2.5",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="3.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="6.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="+Inf",} 1.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 0.112172393
# HELP http_server_requests_seconds_max  
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 0.112172393

Non-working:

# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 0.055793335
# HELP http_server_requests_seconds_max  
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",} 0.055793335

When both variants are scraped they seem to cause some sort of conflict?

Also we have observed that we are getting large amounts of error logs from the collector pods about histogram samples:

ts=2021-12-17T12:47:01.168Z caller=log.go:124 component=gcm_exporter level=debug msg="building sample failed" err="no sample consumed for histogram"

This happens all the time even with only the working set of services/metrics being scraped and still it works, data seems to be ingested. Might not be related, but felt it was worth mentioning.

I hope that this description above is understandable. Any input about this problem would be greatly appreciated, we're a bit stuck!

Thanks in advance!

grzzboot avatar Dec 17 '21 14:12 grzzboot

Thanks for the detailed report and the thorough investigation.

I think there are orthogonal issues at play here.

  1. In the first metrics you pasted there's a histogram typed metric http_server_requests_seconds declared but it then starts listing a bunch of quantile series, which belong to summary metrics. That's invalid and our ingestion skips these series, which produces the logged error. You won't be able to query these quantiles. As this is simply invalid /metrics format, the only thing we could do on our end is adding a one-off hack to support this particular issue – but the client being fixed would be preferable here. Overall this is an odd case as an instrumented histogram shouldn't have a pre-computed quantile to expose to begin with. Is the code instrumented with Micrometer with some Prometheus adapter?

  2. The issue causing you to not be able to query is that your second service also has a metrichttp_server_requests_seconds but this time declared as a summary type. (Oddly enough this one does not have the precomputed quantiles a summary would typically have.) This causes a problem now in our query backend as http_server_requests_seconds_sum and http_server_requests_seconds_count are ambigious – they could refer to either the histogram or the summary. As our backend is strongly typed (unlike Prometheus) this causes a lookup conflict. We should be able to make this work with some additional logic in our backend. Until then the only recommendation I can offer is either changing the metric name in the instrumentation (if possible) or using metric relabeling to manually rename the series for one of the services. If you are using managed collection metric relabeling is not surfaced yet through PodMonitoring but will be with the next release.

fabxc avatar Dec 17 '21 15:12 fabxc

Thank you very much for your quick answers!

  1. The first set of metrics are generated by a Spring Boot service with a configuration to publish the following percentiles: 0.60, 0.70, 0.80, 0.90, 0.95, 0.99, 0.999.

These numbers are just examples of course, they could be anything really.

Computed non-aggregable percentiles together with a percentiles histogram are standard metric configuration options of Spring Boot that you can use to obtain a gauge over the distribution of the observations together with a histogram:

  • https://docs.spring.io/spring-boot/docs/current/reference/html/application-properties.html#application-properties.actuator.management.metrics.web.server.request.autotime.percentiles
  • https://docs.spring.io/spring-boot/docs/current/reference/html/application-properties.html#application-properties.actuator.management.metrics.web.server.request.autotime.percentiles-histogram

So these entries are the computed non-aggregable percentiles from the first set of metrics above. They are interpreted as a Gauge by the Prometheus instance we use today and that we want to migrate away from:

http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.6",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.7",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.8",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.9",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.95",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.99",} 0.109051904
http_server_requests_seconds{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",quantile="0.999",} 0.109051904

While these are the percentiles-histogram specific entries:

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.01",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.05",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.1",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.2",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.4",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.6",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="0.8",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="1.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="1.5",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="2.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="2.5",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="3.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="6.0",} 1.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/manage/health",le="+Inf",} 1.0

This behaviour can be reproduced by creating a Demo Spring App using the Spring initializr: https://start.spring.io/#!type=maven-project&language=java&platformVersion=2.6.1&packaging=jar&jvmVersion=11&groupId=com.example&artifactId=demo&name=demo&description=Demo%20project%20for%20Spring%20Boot&packageName=com.example.demo&dependencies=actuator,web,prometheus

And then adding the following properties in the application.properties config:

management.metrics.web.server.request.autotime.percentiles=0.60, 0.70, 0.80, 0.90, 0.95, 0.99, 0.999
management.metrics.web.server.request.autotime.percentiles-histogram=true
management.endpoints.web.exposure.include=prometheus
  1. The second set of metrics is simply the same Spring Boot App but without the config for percentiles and percentiles-histogram. application.properties then looks like this:
management.endpoints.web.exposure.include=prometheus

I think we can get around the query problem at least for now by simply making sure that all our services emit metrics with a histogram instead of a mix where some do and some do not. After all I think we want histograms for all, it just wasn't configured that way.

grzzboot avatar Dec 20 '21 10:12 grzzboot

This has been fixed - histogram-type metrics will now intelligently ignore or merge transient (incorrect) conflicting definitions of scalar or summary metrics.

lyanco avatar Sep 21 '22 19:09 lyanco