envoy When the number of clusters is large, there is a large periodic delay in xds update

Title: When the number of clusters is large, there is a large periodic delay in xds update

Description:

When the cluster configured through xds is around 4000, the update time is extended by about 1 second under normal circumstances, but there will be a delay of 6 to 7 seconds every half a minute. As the number of clusters increases, this periodic large delay will increase. For every 1000 clusters added, it will increase by about 2 seconds. Observing through netstat, it is found that when there is a large delay, the xds update data reaches the tcp receive buffer normally, but is not consumed by envoy. It takes a large delay before the buffer is consumed.

Repro steps:

Using contour as the control plane, deploy envoy as the traffic gateway of the k8s cluster in the mode of edge proxy. The test environment is constructed in such a way that each ingress corresponds to one service, and every 50 services corresponds to one deployment. Create 4000 ingress and corresponding resources. Then add an ingress or delete an ingress, and observe the effective delay of the ingress change. It is found that large delays occur periodically. Observing through netstat, it is found that when there is a large delay, the xds update data reaches the tcp receive buffer normally, but is not consumed by envoy. It takes a large delay before the buffer is consumed.

Sorry, no additional log information is available due to network isolation

Aug 08 '22 07:08 hxysayhi

What version of Envoy are you using? There were several fixes to improve CPU time to process CDS. This issue sounds like main thread CPU is the bottleneck.

Aug 08 '22 17:08 kyessenov

cc @adisuissa

Aug 08 '22 17:08 kyessenov

Envoy version is 1.21.1.

What makes me wonder is why envoy does not consume xds data in tcp buffer for a fixed period every half a minute, even though the load is idle and no other configuration changes have been made. In other time periods, continuous changes can still take effect immediately (about 1s). Configuration changes that take effect slowly are no different from those that take effect normally.

Aug 09 '22 00:08 hxysayhi

Can you collect some cpu profile while running this? The ingestion is done by the main thread (not the worker threads), so there is a single thread that is in-charge of that. What may be happening is that Envoy ingests a large config (which takes a long time), and keeps the data in the gRPC buffer while processing it, but it's challenging to debug without traces and without profiling.

Aug 09 '22 08:08 adisuissa

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Sep 08 '22 12:09 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sep 15 '22 16:09 github-actions[bot]