higress eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据

Open lcfang opened this issue 2 months ago • 4 comments

If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via ASRC(Alibaba Security Response Center) where the issue will be triaged appropriately.

[x] I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

当eureka中服务注册数量达到1w+时，gateway的CPU使用率达到100+

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace: 从监控看到controller和gateway的cpu使用率达到了100%，controller有发生重启（报错：https://github.com/alibaba/higress/issues/1536），所以在gateway中有报Prom抓取数据失败，也有报xDS连接断开，尝试调整controller和gatewayCPU （，pilot原来是2c，调整为8c，gateway原来是4c，调整到了12c），问题依然存在，后来尝试停了Prometheus的数据抓取，CPU依然很高。 fdfab5f053810dd5555a58de7f5009f

Ⅲ. Describe what you expected to happen

CPU使用率正常（不超过50%？）

Ⅳ. How to reproduce it (as minimally and precisely as possible)

helm 部署higress v2.0.2版本
gateway开启Prometheus监控
eureka注册中心模拟注册1w个服务
在higress前台添加eureka注册中心
Prometheus抓取数据失败，gateway报错，controller会重启

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

Higress version: 2.0.2
OS : kylin V10 SP3
Others: k8s v1.28，helm部署方式

Dec 20 '24 13:12 lcfang

higress higress copied to clipboard

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据

Ⅰ. Issue Description

Ⅱ. Describe what happened

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

higress
higress copied to clipboard