higress
higress copied to clipboard
eureka服务数量达到万级时,controller 与 gateway CPU狂飙达100%+,Prometheus监控gateway无法获取数据
If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via ASRC(Alibaba Security Response Center) where the issue will be triaged appropriately.
- [x] I have searched the issues of this repository and believe that this is not a duplicate.
Ⅰ. Issue Description
当eureka中服务注册数量达到1w+时,gateway的CPU使用率达到100+
Ⅱ. Describe what happened
If there is an exception, please attach the exception trace:
从监控看到controller和gateway的cpu使用率达到了100%,controller有发生重启(报错:https://github.com/alibaba/higress/issues/1536),所以在gateway中有报Prom抓取数据失败,也有报xDS连接断开,尝试调整controller和gatewayCPU (,pilot原来是2c,调整为8c,gateway原来是4c,调整到了12c),问题依然存在,后来尝试停了Prometheus的数据抓取,CPU依然很高。
Ⅲ. Describe what you expected to happen
CPU使用率正常(不超过50%?)
Ⅳ. How to reproduce it (as minimally and precisely as possible)
- helm 部署higress v2.0.2版本
- gateway开启Prometheus监控
- eureka注册中心模拟注册1w个服务
- 在higress前台添加eureka注册中心
- Prometheus抓取数据失败,gateway报错,controller会重启
Ⅴ. Anything else we need to know?
Ⅵ. Environment:
- Higress version: 2.0.2
- OS : kylin V10 SP3
- Others: k8s v1.28,helm部署方式