VictoriaMetrics
VictoriaMetrics copied to clipboard
vmagent k8s target discovery is too slow
Is your feature request related to a problem? Please describe
When there are many configured jobs (about 100), vmagent discovers targets very slowly in serial, resulting in no data collection by vmagent for more than half an hour.
Since each instance in the vmagent cluster needs to discover all the collection targets before sharding, horizontal scaling cannot solve the problem of slow service discovery.
Describe the solution you'd like
- Can service discovery sharding be added to resolve the performance bottleneck in service discovery?
- Can concurrent service discovery be added?
Describe alternatives you've considered
No response
Additional information
No response
hey @aluode99 this log record appears during scrape configs reloading could you please share info about CPU, memory usage? which VMAgent version are you using? could you please describe a setup you're running it in?
hi @AndrewChubatiuk thank you for your reply.The detailed configuration for vmagent is as follows:
version:v1.96.0
cpu: 18c
memory: 16G
cluster membersCount: 19
cluster replicationFactor: 1
CPU utilization rate:
The use case involves loading kubernetes_sd_configs through a sidecar, and then invoking vmagent reload to load the configuration. When the pod starts, kubernetes_sd_configs is empty,so the service discovery takes 0 seconds. After the sidecar loads the configuration, vmagent reloads and the service discovery takes 2061 seconds.
Due to the kubernetes_sd_configs being empty at startup, the startup process remains blocked at the code checkpoint 1 and does not proceed to the reload process at checkpoint 2. As a result, vmagent does not incrementally load the configuration to gradually activate the collection tasks. Instead, it spends 2061 seconds to complete the discovery of all targets before beginning the collection tasks, leading to a 2061-second period without data collection.
How much time takes the next configuration update after initial one? Could you please share information about etcd and kube api request duration?
I have compiled the duration of some reloads, with a total time of about 7 minutes. The shortest duration was 0.002 seconds, and the longest was 1.139 seconds. The detailed durations are as follows:
|count |time(s)|
|------|------|
| 30 | 0.002 |
| 32 | 0.003 |
|104 | 0.004 |
|206 | 0.005 |
|149 | 0.006 |
|131 | 0.007 |
|177 | 0.008 |
| 83 | 0.009 |
|152 | 0.010 |
| 83 | 0.011 |
| 60 | 0.012 |
| 75 | 0.013 |
|114 | 0.014 |
|124 | 0.015 |
| 88 | 0.016 |
| 72 | 0.017 |
| 96 | 0.018 |
|110 | 0.019 |
| 55 | 0.020 |
| 48 | 0.021 |
| 80 | 0.022 |
| 68 | 0.023 |
|106 | 0.024 |
|115 | 0.025 |
| 86 | 0.026 |
|113 | 0.027 |
| 64 | 0.028 |
| 88 | 0.029 |
|102 | 0.030 |
|100 | 0.031 |
|105 | 0.032 |
| 87 | 0.033 |
| 99 | 0.034 |
|116 | 0.035 |
| 81 | 0.036 |
| 61 | 0.037 |
| 74 | 0.038 |
| 59 | 0.039 |
| 58 | 0.040 |
| 67 | 0.041 |
| 69 | 0.042 |
| 66 | 0.043 |
| 74 | 0.044 |
| 72 | 0.045 |
| 62 | 0.046 |
| 66 | 0.047 |
| 74 | 0.048 |
| 35 | 0.049 |
| 44 | 0.050 |
| 36 | 0.051 |
| 44 | 0.052 |
| 44 | 0.053 |
| 33 | 0.054 |
| 44 | 0.055 |
| 39 | 0.056 |
| 44 | 0.057 |
| 41 | 0.058 |
| 48 | 0.059 |
| 40 | 0.060 |
| 36 | 0.061 |
| 29 | 0.062 |
| 32 | 0.063 |
| 28 | 0.064 |
| 23 | 0.065 |
| 29 | 0.066 |
| 41 | 0.067 |
| 31 | 0.068 |
| 22 | 0.069 |
| 40 | 0.070 |
| 25 | 0.071 |
| 30 | 0.072 |
| 33 | 0.073 |
| 27 | 0.074 |
| 41 | 0.075 |
| 33 | 0.076 |
| 30 | 0.077 |
| 15 | 0.078 |
| 35 | 0.079 |
| 22 | 0.080 |
| 23 | 0.081 |
| 16 | 0.082 |
| 16 | 0.083 |
| 15 | 0.084 |
| 24 | 0.085 |
| 24 | 0.086 |
| 22 | 0.087 |
| 20 | 0.088 |
| 27 | 0.089 |
| 28 | 0.090 |
| 23 | 0.091 |
| 22 | 0.092 |
| 20 | 0.093 |
| 12 | 0.094 |
| 12 | 0.095 |
| 11 | 0.096 |
| 11 | 0.097 |
| 8 | 0.098 |
| 11 | 0.099 |
| 6 | 0.100 |
| 6 | 0.101 |
| 10 | 0.102 |
| 17 | 0.103 |
| 15 | 0.104 |
| 14 | 0.105 |
| 12 | 0.106 |
| 7 | 0.107 |
| 14 | 0.108 |
| 11 | 0.109 |
| 8 | 0.110 |
| 7 | 0.111 |
| 2 | 0.112 |
| 5 | 0.113 |
| 3 | 0.114 |
| 5 | 0.115 |
| 5 | 0.116 |
| 3 | 0.117 |
| 7 | 0.118 |
| 4 | 0.119 |
| 6 | 0.120 |
| 3 | 0.121 |
| 6 | 0.122 |
| 5 | 0.123 |
| 8 | 0.124 |
| 7 | 0.125 |
| 4 | 0.126 |
| 9 | 0.127 |
| 4 | 0.128 |
| 9 | 0.129 |
| 6 | 0.130 |
| 8 | 0.131 |
| 5 | 0.132 |
| 14 | 0.133 |
| 9 | 0.134 |
| 7 | 0.135 |
| 6 | 0.136 |
| 8 | 0.137 |
| 5 | 0.138 |
| 8 | 0.139 |
| 6 | 0.140 |
| 5 | 0.141 |
| 8 | 0.142 |
| 7 | 0.143 |
| 6 | 0.144 |
| 5 | 0.145 |
| 4 | 0.146 |
| 5 | 0.147 |
| 5 | 0.148 |
| 4 | 0.149 |
| 7 | 0.150 |
| 8 | 0.151 |
| 3 | 0.152 |
| 2 | 0.153 |
| 1 | 0.154 |
| 2 | 0.155 |
| 3 | 0.156 |
| 6 | 0.157 |
| 3 | 0.158 |
| 6 | 0.159 |
| 4 | 0.160 |
| 3 | 0.161 |
| 5 | 0.162 |
| 4 | 0.164 |
| 3 | 0.165 |
| 4 | 0.166 |
| 3 | 0.168 |
| 2 | 0.169 |
| 1 | 0.170 |
| 2 | 0.171 |
| 1 | 0.174 |
| 1 | 0.175 |
| 1 | 0.176 |
| 1 | 0.178 |
| 1 | 0.181 |
| 1 | 0.183 |
| 2 | 0.184 |
| 1 | 0.185 |
| 1 | 0.193 |
| 2 | 0.199 |
| 1 | 0.206 |
| 1 | 0.207 |
| 2 | 0.211 |
| 1 | 0.213 |
| 1 | 0.214 |
| 1 | 0.227 |
| 1 | 0.256 |
| 1 | 0.257 |
| 1 | 0.269 |
| 1 | 0.586 |
| 1 | 0.605 |
| 1 | 0.617 |
| 1 | 0.785 |
| 1 | 0.803 |
| 1 |1.085 |
| 1 |1.108 |
| 1 |1.139 |