consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

Prometheus scraping errors in injected consul-sidecar container

Open liad5h opened this issue 2 years ago • 5 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When I configure configure the Consul sidecar to run a merged metrics server, I see the following logs in the consul-sidecar container:

2022-12-21T08:22:50.757Z [INFO]  Command configuration: enable-service-registration=false service-config="" consul-binary=consul sync-period=10s log-level=info enable-metrics-merging=true merged-metrics-port=20100 service-metrics-port=80 service-metrics-path=/stats/prometheus
2022-12-21T08:22:50.757Z [INFO]  Metrics is enabled, creating merged metrics server.
2022-12-21T08:22:50.757Z [INFO]  Running merged metrics server.
2022-12-21T08:23:15.657Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""
2022-12-21T08:24:15.657Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""
2022-12-21T08:25:15.583Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""

The log is repeated every 1 minute - on every prometheus scrape.

By default, connectInject.metrics.defaultPrometheusScrapePath is set to /metrics. I tried setting it to /stats/prometheus according to https://developer.hashicorp.com/consul/docs/v1.12.x/k8s/connect/observability/metrics but I still get the same behavior.

I believe that this error appears because scraping is done (also?) on port 80 (application port) which does not have a /metrics nor /stats/prometheus route.

Reproduction Steps

Inject the sidecar container into a pod and check the logs of the consul-sidecar container.

Logs

Added at issue description.

Expected behavior

This error should be addressed and hopefully be more detailed.

Environment details

Chart version 0.45.0 (Consul 1.12.2) AWS EKS 1.21

Helm Values
global:
  name: consul
  datacenter: us-east-1-qa
  enabled: false
  gossipEncryption:
    secretName: consul-gossip
    secretKey: key
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: consul-bootstrap-acl
      secretKey: token
  metrics:
    enabled: true
    enableAgentMetrics: true
    agentMetricsRetentionTime: "1m"
    defaultPrometheusScrapePath: "/metrics"
    enableGatewayMetrics: false
client:
  enabled: true
  grpc: true
  exposeGossipPorts: true
  resources:
    requests:
      memory: "100Mi"
      cpu: "10m"
    limits:
      memory: "250Mi"
      cpu: "100m"
  extraConfig: |
    {
      "telemetry": {
        "disable_hostname": true
      }
    }

#

server:
  enabled: true
  replicas: 3
  exposeGossipAndRPCPorts: false
  connect: true
  extraConfig: |
    {
      "performance": {
        "raft_multiplier": 1
      },
      "telemetry": {
        "disable_hostname": true
      }
    }
#

  resources:
    requests:
      memory: "1Gi"
      cpu: "250m"
    limits:
      memory: "2Gi"
      cpu: "1000m"

#

prometheus:
  enabled: false

connectInject:
  enabled: true
  replicas: 2
  default: false
  metrics:
    defaultEnabled: true
    defaultEnableMerging: true
    defaultPrometheusScrapePath: "/stats/prometheus"
  resources:
    requests:
      memory: "50Mi"
      cpu: "50m"
    limits:
      memory: "250Mi"
      cpu: "300m"
controller:
  enabled: true
ui:
  enabled: true
  metrics:
    enabled: true
    provider: "prometheus"
    baseURL: http://prom-prometheus-server.prometheus.svc.cluster.local:443
  service:
    type: ClusterIP

Additional Context

I tried working around this by adding the following annotations to my deployment:

consul.hashicorp.com/merged-metrics-port: "20100"
consul.hashicorp.com/service-metrics-path: /stats/prometheus

I also tried setting consul.hashicorp.com/prometheus-scrape-port: "20100" but that failed due to port binding error.

liad5h avatar Dec 21 '22 09:12 liad5h

I am seeing the same problem as well... Is there any solution or workaround?

itaytalmi avatar Jan 27 '23 01:01 itaytalmi

Same here with 0.49.2/0.49.4. Could be some upgrade failure but I don't know how to debug.

junjie-landing avatar Feb 14 '23 17:02 junjie-landing

I get a similar error on Consul 1.14.4 2023-03-15T18:23:30.641Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:80/metrics error="status code 404" 2023/03/15 18:23:30 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:273)

It seems to be because the actual application service doesn't provide prometheus metrics at that endpoint. I don't mind the error in the sidecar, however the merged metrics prometheus endpoint throws up an error 2023-03-15T18:23:30.641Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:80/metrics error="status code 404" 2023/03/15 18:23:30 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:273)

Which seems to mess up Prometheus scraping of what is available.

The annotation consul.hashicorp.com/enable-metrics-merging: "false" can be added to workloads that do not provide prometheus metrics or one of the other annotations to point to where your prom metrics are.

I think it would be nice of the merged metrics endpoint didn't append anything that wasn't valid to prometheus though.

bbgobie avatar Mar 16 '23 13:03 bbgobie

Out of curiosity is this still an issue on our latest release branches, Consul K8s 1.2.x and Consul 1.16.x?

david-yu avatar Aug 25 '23 07:08 david-yu

Hi @david-yu. I'm revisiting this now as of Consul 1.16.1/Consul K8s 1.2.1, and this issue definitely persists. I'm sharing some of my findings here.

  1. In the sidecar container logs, I'm still getting the following error messages:
2023-08-28T15:37:10.425Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:9090/metrics error="Get \"http://127.0.0.1:9090/metrics\": dial tcp 127.0.0.1:9090: connect: connection refused"
2023/08/28 15:37:10 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:289)
  1. Attempting to manually retrieve the metrics from one of the pods using a simple curl\wget request to http://$POD_IP_ADDR:20200/metrics, I can see the metrics are there. For example:
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="30000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="60000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="300000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="600000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="1800000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="3600000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="+Inf"} 1
envoy_server_initialization_time_ms_sum{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1"} 345
envoy_server_initialization_time_ms_count{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1"} 1

However, the end of the output also contains this entry:

failed to scrape metrics at url "http://127.0.0.1:8080/metrics"

Which Prometheus is unable to parse. Prometheus discover and list the targets. However, the target status is: strconv.ParseFloat: parsing "to": invalid syntax while parsing: "failed to", which makes sense due to the above entry.

The above only occurs if the merged metrics feature is enabled.

itaytalmi avatar Aug 28 '23 17:08 itaytalmi