consul-k8s
consul-k8s copied to clipboard
Prometheus scraping errors in injected consul-sidecar container
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
When I configure configure the Consul sidecar to run a merged metrics server, I see the following logs in the consul-sidecar
container:
2022-12-21T08:22:50.757Z [INFO] Command configuration: enable-service-registration=false service-config="" consul-binary=consul sync-period=10s log-level=info enable-metrics-merging=true merged-metrics-port=20100 service-metrics-port=80 service-metrics-path=/stats/prometheus
2022-12-21T08:22:50.757Z [INFO] Metrics is enabled, creating merged metrics server.
2022-12-21T08:22:50.757Z [INFO] Running merged metrics server.
2022-12-21T08:23:15.657Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""
2022-12-21T08:24:15.657Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""
2022-12-21T08:25:15.583Z [ERROR] Received non-2xx status code scraping service metrics: code=404 response=""
The log is repeated every 1 minute - on every prometheus scrape.
By default, connectInject.metrics.defaultPrometheusScrapePath
is set to /metrics
.
I tried setting it to /stats/prometheus
according to https://developer.hashicorp.com/consul/docs/v1.12.x/k8s/connect/observability/metrics but I still get the same behavior.
I believe that this error appears because scraping is done (also?) on port 80 (application port) which does not have a /metrics
nor /stats/prometheus
route.
Reproduction Steps
Inject the sidecar container into a pod and check the logs of the consul-sidecar
container.
Logs
Added at issue description.
Expected behavior
This error should be addressed and hopefully be more detailed.
Environment details
Chart version 0.45.0 (Consul 1.12.2) AWS EKS 1.21
Helm Values
global:
name: consul
datacenter: us-east-1-qa
enabled: false
gossipEncryption:
secretName: consul-gossip
secretKey: key
acls:
manageSystemACLs: true
bootstrapToken:
secretName: consul-bootstrap-acl
secretKey: token
metrics:
enabled: true
enableAgentMetrics: true
agentMetricsRetentionTime: "1m"
defaultPrometheusScrapePath: "/metrics"
enableGatewayMetrics: false
client:
enabled: true
grpc: true
exposeGossipPorts: true
resources:
requests:
memory: "100Mi"
cpu: "10m"
limits:
memory: "250Mi"
cpu: "100m"
extraConfig: |
{
"telemetry": {
"disable_hostname": true
}
}
#
server:
enabled: true
replicas: 3
exposeGossipAndRPCPorts: false
connect: true
extraConfig: |
{
"performance": {
"raft_multiplier": 1
},
"telemetry": {
"disable_hostname": true
}
}
#
resources:
requests:
memory: "1Gi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
#
prometheus:
enabled: false
connectInject:
enabled: true
replicas: 2
default: false
metrics:
defaultEnabled: true
defaultEnableMerging: true
defaultPrometheusScrapePath: "/stats/prometheus"
resources:
requests:
memory: "50Mi"
cpu: "50m"
limits:
memory: "250Mi"
cpu: "300m"
controller:
enabled: true
ui:
enabled: true
metrics:
enabled: true
provider: "prometheus"
baseURL: http://prom-prometheus-server.prometheus.svc.cluster.local:443
service:
type: ClusterIP
Additional Context
I tried working around this by adding the following annotations to my deployment:
consul.hashicorp.com/merged-metrics-port: "20100"
consul.hashicorp.com/service-metrics-path: /stats/prometheus
I also tried setting consul.hashicorp.com/prometheus-scrape-port: "20100"
but that failed due to port binding error.
I am seeing the same problem as well... Is there any solution or workaround?
Same here with 0.49.2/0.49.4. Could be some upgrade failure but I don't know how to debug.
I get a similar error on Consul 1.14.4
2023-03-15T18:23:30.641Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:80/metrics error="status code 404" 2023/03/15 18:23:30 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:273)
It seems to be because the actual application service doesn't provide prometheus metrics at that endpoint.
I don't mind the error in the sidecar, however the merged metrics prometheus endpoint throws up an error
2023-03-15T18:23:30.641Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:80/metrics error="status code 404" 2023/03/15 18:23:30 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:273)
Which seems to mess up Prometheus scraping of what is available.
The annotation consul.hashicorp.com/enable-metrics-merging: "false"
can be added to workloads that do not provide prometheus metrics or one of the other annotations to point to where your prom metrics are.
I think it would be nice of the merged metrics endpoint didn't append anything that wasn't valid to prometheus though.
Out of curiosity is this still an issue on our latest release branches, Consul K8s 1.2.x and Consul 1.16.x?
Hi @david-yu. I'm revisiting this now as of Consul 1.16.1/Consul K8s 1.2.1, and this issue definitely persists. I'm sharing some of my findings here.
- In the sidecar container logs, I'm still getting the following error messages:
2023-08-28T15:37:10.425Z [ERROR] consul-dataplane.metrics: failed to scrape metrics: url=http://127.0.0.1:9090/metrics error="Get \"http://127.0.0.1:9090/metrics\": dial tcp 127.0.0.1:9090: connect: connection refused"
2023/08/28 15:37:10 http: superfluous response.WriteHeader call from github.com/hashicorp/consul-dataplane/pkg/consuldp.(*metricsConfig).scrapeError (metrics.go:289)
- Attempting to manually retrieve the metrics from one of the pods using a simple
curl\wget
request tohttp://$POD_IP_ADDR:20200/metrics
, I can see the metrics are there. For example:
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="30000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="60000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="300000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="600000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="1800000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="3600000"} 1
envoy_server_initialization_time_ms_bucket{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1",le="+Inf"} 1
envoy_server_initialization_time_ms_sum{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1"} 345
envoy_server_initialization_time_ms_count{local_cluster="public-api",consul_source_service="public-api",consul_source_namespace="default",consul_source_partition="default",consul_source_datacenter="dc1"} 1
However, the end of the output also contains this entry:
failed to scrape metrics at url "http://127.0.0.1:8080/metrics"
Which Prometheus is unable to parse. Prometheus discover and list the targets. However, the target status is: strconv.ParseFloat: parsing "to": invalid syntax while parsing: "failed to"
, which makes sense due to the above entry.
The above only occurs if the merged metrics feature is enabled.