VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

vmagent: too many connections to APISERVER on AKS

Open prasadrajesh opened this issue 11 months ago • 17 comments

Describe the bug

I have average <1500 endpoints on my prod environment to scrape as below screenshot. image

When I am using netstat command inside vmagent pod to know how many connection "ESTABLISHED" to scrape I am getting right value except endpoint kubeapiserver. In the below output I excluded apiserver endpoint. ` ~ $ netstat -atn | grep ESTABLISHED | grep -v 172.18.0.1:443 | wc -l

1491 `

Now If I am going to know "ESTABLISHED" with KubeAPIserver then I am getting below output. ` ~ $ netstat -atn | grep ESTABLISHED | grep 172.18.0.1:443 | wc -l

1150 `

So, even I tried with "promscrape.disableKeepAlive: false" I am getting the same. Now my question is why vmagent is establishing those very high number of connections with kubeapiserver. I don't think there is any sense.

My cluster was built with publicKubeApiserver config and we are getting prod issue SNAT port exhaustion on AKS. That's product application is not able to server traffic to client as those ports were consumed by vmagent while connecting public KubeAPIserver.

Even I compared with other service like prometheus they are not creating that number of connections with KubeAPIserver.

To Reproduce

I am using with below config for vmagent to enable monitoring of the podmonitor or vmscapeservice. I am using VMOperator to deploy with below config.


apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
  name: vmagent-1
  namespace: monitoring-system
spec:
  podScrapeNamespaceSelector: {}
  podScrapeSelector: {}
  serviceScrapeNamespaceSelector: {}
  serviceScrapeSelector: {}
  extraArgs:
    remoteWrite.forcePromProto: "true"
    envflag.enable: "true"
    envflag.prefix: "VM_"
    sortLabels: "true"
    promscrape.noStaleMarkers: "true"
    promscrape.maxScrapeSize: 200MB
    promscrape.kubernetesSDCheckInterval: 30s
    promscrape.cluster.replicationFactor: "2"
  containers:
    - name: config-reloader
      image: prometheus-operator/prometheus-config-reloader:v0.71.2
      resources:
        limits:
          cpu: "1"
          memory: "2Gi"
  securityContext:
    fsGroup: 65534
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
  image:
    repository: victoriametrics/vmagent
    tag: "v1.93.12"

Version

v1.93.12

Logs

There is no error on vmagent pods. I can't paste logs info logs in public domain due to its having pods and NS info.

Screenshots

No response

Used command-line flags

No response

Additional information

No response

prasadrajesh avatar Mar 14 '24 06:03 prasadrajesh

Hello @prasadrajesh! vmagent supposed to expose a metric vm_promscrape_discovery_kubernetes_group_watchers which shows the current number of watchers requesting updates from k8s api. Each watcher can have up to 100 idle connections established. Could you please show results of the sum(vm_promscrape_discovery_kubernetes_group_watchers) query on the same time interval as your screenshot?

hagen1778 avatar Mar 14 '24 10:03 hagen1778

Output of Query "vm_promscrape_discovery_kubernetes_group_watchers" image

Output of Query "sum(vm_promscrape_discovery_kubernetes_group_watchers)" image

I think each watcher is creating only connection. But why we do have that high number of watcher required. Can't we stay with one connection instead of creating very high number of connection? the count of apiserver connection is almost 40% from all scraping endpoint connections.

prasadrajesh avatar Mar 14 '24 10:03 prasadrajesh

Can't we stay with one connection instead of creating very high number of connection? the count of apiserver connection is almost 40% from all scraping endpoint connections.

The watcher is created for each specific ServiceDiscovery config. It is basically defined by combination of:

  • API server address
  • list of namespaces
  • list of selectors and also proxy_url and settings like attach_metadata. You can see when new watcher is created here https://github.com/VictoriaMetrics/VictoriaMetrics/blob/a2ea8bc97b717fdcf48b9126015c9d6922f90a3d/lib/promscrape/discovery/kubernetes/api_watcher.go#L287-L307

So I'd suggest verifying how many SD configs you have per vmagent to understand how many unique watchers it could have created. And to check if there is a bug in vmagent code - could you verify how max(vm_promscrape_discovery_kubernetes_group_watchers) behaved on the long time range of couple of days?

hagen1778 avatar Mar 14 '24 11:03 hagen1778

Thanks @hagen1778 Thanks for quick reply.

Isn't it possible that multiplexing multiple watches over a single HTTP/2 connection? So, even though I will have 1000 watcher I will have only one connection instead of going though one goroutine for each watcher?

prasadrajesh avatar Mar 14 '24 12:03 prasadrajesh

It is not a very good approach taking into account that every SD config should be updated independently. So we should expect that if you have 1k SD configs all of them could be updated concurrently, to deliver changes as fast as possible.

Btw, I've checked metrics for couple of production metrics from different vmagent installations - none of them has 1k watches. It is usually below 20. Only one case has 170 watchers. Do you use some standard SD configs?

hagen1778 avatar Mar 14 '24 13:03 hagen1778

Ans of your question: I have only one SD config that is "kubernetes_sd_configs" and target scraped by defined at "vmagent.env.yaml".

Suggestion: My point was "to enable multiplexing multiple watches over a single HTTP/2 connection" like "prometheus-config-reloader" is doing.

Workaround Actually to update SD configs I am using "prometheus-config-reloader" and I was hoping the container "prometheus-config-reloader" (which is running inside same pod where VMAGENT is running) should take responsibility to update "vmagent.env.yaml" file instead of vmagent itself. But still vmagent is taking responsibility to watch APIs.

So, @hagen1778 How can I disable API watch from VMAGENT and tell to VMAGENT that just follow "vmagent.env.yaml" (which was created by "prometheus-config-reloader" and keep updated by "prometheus-config-reloader")?

prasadrajesh avatar Mar 18 '24 07:03 prasadrajesh

Actually to update SD configs I am using "prometheus-config-reloader" and I was hoping the container "prometheus-config-reloader" (which is running inside same pod where VMAGENT is running) should take responsibility to update "vmagent.env.yaml" file instead of vmagent itself. But still vmagent is taking responsibility to watch APIs.

I think prometheus-config-reloader just updates configs for SD, not actually doing the SD. So option for disabling the discovery won't work.

@Haleygo do you have any opinion on this? It looks related to https://github.com/VictoriaMetrics/operator/pull/267

hagen1778 avatar Mar 18 '24 13:03 hagen1778

Hello @prasadrajesh ,

How can I disable API watch from VMAGENT and tell to VMAGENT that just follow "vmagent.env.yaml"

Like @hagen1778 mentioned, vmagent does just following vmagent.env.yaml which updated by config-reloader, but it needs to watch apiserver to dynamically discover real targets like prometheus does. For example, if you create a VMServiceScrape

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
  name: vmalertmanager-t-victoria-metrics-k8s-stack
  namespace: test
spec:
  endpoints:
  - attach_metadata: {}
    path: /metrics
    port: http
  namespaceSelector: {}
  selector:
    matchExpressions:
    - key: operator.victoriametrics.com/additional-service
      operator: DoesNotExist
    matchLabels:
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/instance: t-victoria-metrics-k8s-stack
      app.kubernetes.io/name: vmalertmanager
      managed-by: vm-operator

vm-operator will add a scrape job to vmagent's config file which mounted using secret by default

- job_name: serviceScrape/test/vmalertmanager-t-victoria-metrics-k8s-stack/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - test
  metrics_path: /metrics
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_component
    regex: monitoring
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_instance
    regex: t-victoria-metrics-k8s-stack
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_name
    regex: vmalertmanager
  - target_label: endpoint
    replacement: http

Then config-reloader will detect this file change and reload the vmagent process, vmagent will create a new watcher with url like https://{{apiserver-address}}/api/v1/namespaces/test/endpointswatch=1&allowWatchBookmarks=true&timeoutSeconds=xx and find all the endpoints as scrape targets. So there will be at least one watch connection for each scrape job. If the job role is endpoints or endpointslice, there will be two more connections to watch related pod and service resources.

Could you share how many scrape jobs in your scrape config, and how about vm_promscrape_discovery_kubernetes_url_watchers, does it goes down when you restart the vmagent?

Haleygo avatar Mar 21 '24 16:03 Haleygo

@Haleygo do you have any opinion on this? It looks related to https://github.com/VictoriaMetrics/operator/pull/267

@hagen1778 No, it's not related. since https://github.com/VictoriaMetrics/operator/pull/267 is about reduce pressure for apiserver in operator when generating scrape configs.

Haleygo avatar Mar 21 '24 16:03 Haleygo

Thanks for the explanation @Haleygo! Do you think there is room for improvement as @prasadrajesh mentioned?

hagen1778 avatar Mar 22 '24 07:03 hagen1778

I wrote a service where I am converting HTTP1.1 requests to HTTP2 (so, by default 100 connections converted to 1 connection). I put that service between VMAgent and ApiServer. It fixed my issue.

prasadrajesh avatar Apr 15 '24 16:04 prasadrajesh

Is there any update on this? We are currently running into the same issue (SNAT Port exhaustion= on our public API AKS Cluster. For now we added a NAT Gateway to mitigate that, but thats quiet expensive.

image image

Mahagon avatar Aug 21 '24 12:08 Mahagon