consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

K8S Prometheus deployed inside consul service mesh get connection refused on all outbound connections are a random amount of time

Open codex70 opened this issue 8 months ago • 0 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

I have deployed prometheus and grafana using the kube-prometheus-stack helm template. Consul is also deployed using the standard helm template. Connect inject is enabled but not set to default. The Prometheus pod is deployed inside the service mesh with transparent proxy enable. By default it connects to a number of pods and services, some inside the service mesh and some outside.

When starting Prometheus, everything works as expected, then after a random amount of time, all connection that go through the transparent proxy return a connection refused error.

Interestingly, if I exclude certain outbound ports, any connections over those ports work correctly without issue. I also had to exclude certain inbound ports for Prometheus to work.

Reproduction Steps

  1. When running helm install with the following values.yml:
  global:
    name: consul
    metrics:
      enabled: true
    tls:
      enabled: true
      enableAutoEncrypt: true
      verify: true
    gossipEncryption:
      secretName: consul-gossip-encryption-key
      secretKey: key
    federation:
      enabled: true
      createFederationSecret: true
    acls:
      manageSystemACLs: true
      createReplicationToken: true

  server:
    replicas: 3
    securityContext:
      runAsNonRoot: false
      runAsUser: 0

  connectInject:
    enabled: true
    default: false
  meshGateway:
    enabled: true

  syncCatalog:
    enabled: true
    default: true
    toConsul: true
    toK8S: true
    syncClusterIPServices: false
  1. This is the relevant section of the values.yml for Prometheus:
  prometheus:
    prometheusSpec:
      podMetadata:
        annotations:
          consul.hashicorp.com/connect-service: "prometheus-grafana-kube-pr-prometheus"
          consul.hashicorp.com/transparent-proxy-exclude-outbound-ports: "9093"
          consul.hashicorp.com/connect-inject: "true"
          consul.hashicorp.com/connect-service: "prometheus-grafana-kube-pr-prometheus"
          consul.hashicorp.com/transparent-proxy-exclude-inbound-ports: "8080,9090,10901,10902"

Logs

When viewing the targets in Prometheus, all endpoints that are accessed over the transparent proxy are down: kubernetes-pods (0/19 up)

Each end point has the following error: Get "http://10.X.X.X:XXX/metrics": dial tcp 10.X.X.X:XXX: connect: connection refused

Expected behavior

That prometheus connections do not degrade over time and continue to be accessible.

Environment details

Azure AKS latest version

Additional Context

codex70 avatar Jun 17 '24 13:06 codex70