VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

vmagent k8s target discovery is too slow

Open aluode99 opened this issue 9 months ago • 3 comments

Is your feature request related to a problem? Please describe

When there are many configured jobs (about 100), vmagent discovers targets very slowly in serial, resulting in no data collection by vmagent for more than half an hour. image Since each instance in the vmagent cluster needs to discover all the collection targets before sharding, horizontal scaling cannot solve the problem of slow service discovery.

Describe the solution you'd like

  • Can service discovery sharding be added to resolve the performance bottleneck in service discovery?
  • Can concurrent service discovery be added?

Describe alternatives you've considered

No response

Additional information

No response

aluode99 avatar May 14 '24 12:05 aluode99

hey @aluode99 this log record appears during scrape configs reloading could you please share info about CPU, memory usage? which VMAgent version are you using? could you please describe a setup you're running it in?

AndrewChubatiuk avatar May 14 '24 14:05 AndrewChubatiuk

hi @AndrewChubatiuk thank you for your reply.The detailed configuration for vmagent is as follows: version:v1.96.0 cpu: 18c memory: 16G cluster membersCount: 19 cluster replicationFactor: 1 CPU utilization rate: image

The use case involves loading kubernetes_sd_configs through a sidecar, and then invoking vmagent reload to load the configuration. When the pod starts, kubernetes_sd_configs is empty,so the service discovery takes 0 seconds. After the sidecar loads the configuration, vmagent reloads and the service discovery takes 2061 seconds. image

Due to the kubernetes_sd_configs being empty at startup, the startup process remains blocked at the code checkpoint 1 and does not proceed to the reload process at checkpoint 2. As a result, vmagent does not incrementally load the configuration to gradually activate the collection tasks. Instead, it spends 2061 seconds to complete the discovery of all targets before beginning the collection tasks, leading to a 2061-second period without data collection.

9b10cb899a474bc56762d2ef430e0d03

aluode99 avatar May 15 '24 02:05 aluode99

How much time takes the next configuration update after initial one? Could you please share information about etcd and kube api request duration?

AndrewChubatiuk avatar May 15 '24 06:05 AndrewChubatiuk

image

I have compiled the duration of some reloads, with a total time of about 7 minutes. The shortest duration was 0.002 seconds, and the longest was 1.139 seconds. The detailed durations are as follows:

|count |time(s)|
|------|------|
| 30 | 0.002 |
| 32 | 0.003 |
|104 | 0.004 |
|206 | 0.005 |
|149 | 0.006 |
|131 | 0.007 |
|177 | 0.008 |
| 83 | 0.009 |
|152 | 0.010 |
| 83 | 0.011 |
| 60 | 0.012 |
| 75 | 0.013 |
|114 | 0.014 |
|124 | 0.015 |
| 88 | 0.016 |
| 72 | 0.017 |
| 96 | 0.018 |
|110 | 0.019 |
| 55 | 0.020 |
| 48 | 0.021 |
| 80 | 0.022 |
| 68 | 0.023 |
|106 | 0.024 |
|115 | 0.025 |
| 86 | 0.026 |
|113 | 0.027 |
| 64 | 0.028 |
| 88 | 0.029 |
|102 | 0.030 |
|100 | 0.031 |
|105 | 0.032 |
| 87 | 0.033 |
| 99 | 0.034 |
|116 | 0.035 |
| 81 | 0.036 |
| 61 | 0.037 |
| 74 | 0.038 |
| 59 | 0.039 |
| 58 | 0.040 |
| 67 | 0.041 |
| 69 | 0.042 |
| 66 | 0.043 |
| 74 | 0.044 |
| 72 | 0.045 |
| 62 | 0.046 |
| 66 | 0.047 |
| 74 | 0.048 |
| 35 | 0.049 |
| 44 | 0.050 |
| 36 | 0.051 |
| 44 | 0.052 |
| 44 | 0.053 |
| 33 | 0.054 |
| 44 | 0.055 |
| 39 | 0.056 |
| 44 | 0.057 |
| 41 | 0.058 |
| 48 | 0.059 |
| 40 | 0.060 |
| 36 | 0.061 |
| 29 | 0.062 |
| 32 | 0.063 |
| 28 | 0.064 |
| 23 | 0.065 |
| 29 | 0.066 |
| 41 | 0.067 |
| 31 | 0.068 |
| 22 | 0.069 |
| 40 | 0.070 |
| 25 | 0.071 |
| 30 | 0.072 |
| 33 | 0.073 |
| 27 | 0.074 |
| 41 | 0.075 |
| 33 | 0.076 |
| 30 | 0.077 |
| 15 | 0.078 |
| 35 | 0.079 |
| 22 | 0.080 |
| 23 | 0.081 |
| 16 | 0.082 |
| 16 | 0.083 |
| 15 | 0.084 |
| 24 | 0.085 |
| 24 | 0.086 |
| 22 | 0.087 |
| 20 | 0.088 |
| 27 | 0.089 |
| 28 | 0.090 |
| 23 | 0.091 |
| 22 | 0.092 |
| 20 | 0.093 |
| 12 | 0.094 |
| 12 | 0.095 |
| 11 | 0.096 |
| 11 | 0.097 |
|  8 | 0.098 |
| 11 | 0.099 |
|  6 | 0.100 |
|  6 | 0.101 |
| 10 | 0.102 |
| 17 | 0.103 |
| 15 | 0.104 |
| 14 | 0.105 |
| 12 | 0.106 |
|  7 | 0.107 |
| 14 | 0.108 |
| 11 | 0.109 |
|  8 | 0.110 |
|  7 | 0.111 |
|  2 | 0.112 |
|  5 | 0.113 |
|  3 | 0.114 |
|  5 | 0.115 |
|  5 | 0.116 |
|  3 | 0.117 |
|  7 | 0.118 |
|  4 | 0.119 |
|  6 | 0.120 |
|  3 | 0.121 |
|  6 | 0.122 |
|  5 | 0.123 |
|  8 | 0.124 |
|  7 | 0.125 |
|  4 | 0.126 |
|  9 | 0.127 |
|  4 | 0.128 |
|  9 | 0.129 |
|  6 | 0.130 |
|  8 | 0.131 |
|  5 | 0.132 |
| 14 | 0.133 |
|  9 | 0.134 |
|  7 | 0.135 |
|  6 | 0.136 |
|  8 | 0.137 |
|  5 | 0.138 |
|  8 | 0.139 |
|  6 | 0.140 |
|  5 | 0.141 |
|  8 | 0.142 |
|  7 | 0.143 |
|  6 | 0.144 |
|  5 | 0.145 |
|  4 | 0.146 |
|  5 | 0.147 |
|  5 | 0.148 |
|  4 | 0.149 |
|  7 | 0.150 |
|  8 | 0.151 |
|  3 | 0.152 |
|  2 | 0.153 |
|  1 | 0.154 |
|  2 | 0.155 |
|  3 | 0.156 |
|  6 | 0.157 |
|  3 | 0.158 |
|  6 | 0.159 |
|  4 | 0.160 |
|  3 | 0.161 |
|  5 | 0.162 |
|  4 | 0.164 |
|  3 | 0.165 |
|  4 | 0.166 |
|  3 | 0.168 |
|  2 | 0.169 |
|  1 | 0.170 |
|  2 | 0.171 |
|  1 | 0.174 |
|  1 | 0.175 |
|  1 | 0.176 |
|  1 | 0.178 |
|  1 | 0.181 |
|  1 | 0.183 |
|  2 | 0.184 |
|  1 | 0.185 |
|  1 | 0.193 |
|  2 | 0.199 |
|  1 | 0.206 |
|  1 | 0.207 |
|  2 | 0.211 |
|  1 | 0.213 |
|  1 | 0.214 |
|  1 | 0.227 |
|  1 | 0.256 |
|  1 | 0.257 |
|  1 | 0.269 |
|  1 | 0.586 |
|  1 | 0.605 |
|  1 | 0.617 |
|  1 | 0.785 |
|  1 | 0.803 |
|  1 |1.085  |
| 1  |1.108   |
| 1  |1.139   |

aluode99 avatar May 15 '24 13:05 aluode99