java-control-plane
java-control-plane copied to clipboard
Delta XDS - improved
This is a PR based on work from @sschepens https://github.com/envoyproxy/java-control-plane/pull/152 and @mikegajda https://github.com/HubSpot/java-control-plane/pull/4 .
At Allegro we've implemented it here: https://github.com/allegro/envoy-control/pull/255 .
We've seen 50-70% improvements in CPUs usage in Envoys that do not have significant traffic (mostly do XDS) and ~35% in more RPS intensive apps.
@mikegajda - some of your commits are missing sign-off, do you mind if I squash them and add you as a co-author?
Sure that'd be great, thanks @slonka!
Codecov Report
Merging #166 (fdc7425) into main (c7e27d4) will decrease coverage by
3.77%. The diff coverage is80.33%.
@@ Coverage Diff @@
## main #166 +/- ##
============================================
- Coverage 90.01% 86.24% -3.78%
- Complexity 225 351 +126
============================================
Files 28 41 +13
Lines 721 1141 +420
Branches 57 93 +36
============================================
+ Hits 649 984 +335
- Misses 49 112 +63
- Partials 23 45 +22
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...lane/server/AdsDiscoveryRequestStreamObserver.java | 100.00% <ø> (ø) |
|
| ...lane/server/XdsDiscoveryRequestStreamObserver.java | 100.00% <ø> (ø) |
|
| ...ne/server/callback/SnapshotCollectingCallback.java | 89.58% <25.00%> (-5.98%) |
:arrow_down: |
| ...proxy/controlplane/cache/GroupCacheStatusInfo.java | 66.66% <50.00%> (ø) |
|
| ...oyproxy/controlplane/cache/ResourceMapBuilder.java | 58.82% <58.82%> (ø) |
|
| .../io/envoyproxy/controlplane/cache/SimpleCache.java | 78.05% <68.96%> (-4.68%) |
:arrow_down: |
| ...ne/server/DeltaDiscoveryRequestStreamObserver.java | 72.81% <72.81%> (ø) |
|
| ...server/AdsDeltaDiscoveryRequestStreamObserver.java | 84.21% <84.21%> (ø) |
|
| ...envoyproxy/controlplane/cache/DeltaXdsRequest.java | 88.88% <88.88%> (ø) |
|
| ...oyproxy/controlplane/server/V3DiscoveryServer.java | 93.54% <88.88%> (-3.60%) |
:arrow_down: |
| ... and 22 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
FYI: I'm no longer an employee of Allegro and my involvement with Envoy related projects will decrease.
:wave: @lukidzi & @Ferdudas97 wanted to introduce myself here since @`slonka and I worked on this a bit before. I've picked up work on Envoy again and would love to see this get merged so that it can be used. What's the current status on this from your end, and is there anything I can do to help this progress?
👋 @lukidzi & @Ferdudas97 wanted to introduce myself here since @`slonka and I worked on this a bit before. I've picked up work on Envoy again and would love to see this get merged so that it can be used. What's the current status on this from your end, and is there anything I can do to help this progress?
Hi, work in this pr is probably done and we want to test it on our environment, but publishing snapshots is broken (I wrote about it on slack). gpg: signing failed: unusable secret key. Now, we are waiting for the secret key update.
Tests on our environment are finished. We confirmed that, using delta improves cpu utilization ( dropped by ~50%).
@Ferdudas97 Hello!
I have tried this patch out on Spotify's perimeter, which is Envoy based and uses a custom control plane that sends SoTW updates to our Envoy fleet. The only updates sent concern (atm) changes in clusters endpoints, and we have something like 350 upstream clusters configured in our Envoys.
I tried out this patch on one of our Envoys that serves production traffic (around 8.5K RPS), and didn't observe any noticeable changes in CPU usage, though I could verify that the Envoy was using incremental xDS . So I am curious about the setup you tried this on, since you noticed such a drastic improvement: could you share some details about the setup (e.g. number of clusters, type and frequency of changes etc.)?
Hi @rulex123, I tested it on our environment where we have ~1750 clusters. We are using envoy-control, which is our abstraction layer over java-control-plane. According to our metrics, it returns 200-600 snapshots per minute.

On our service with enabled DELTA_GRPC and wildcard dependencies (service is receiving SoTW), we can see CPU utilization drop by 50%.
Part of envoy's config
dynamic_resources:
lds_config:
ads: {}
resource_api_version: V3
initial_fetch_timeout: 20s
cds_config:
ads: {}
resource_api_version: V3
initial_fetch_timeout: 20s
ads_config:
api_type: DELTA_GRPC
transport_api_version: V3
grpc_services:
envoy_grpc:
cluster_name: envoy-control-xds
I would like to start removing support for xDS v2 altogether, do you think it would be possible to coordinate that effort with adding support for xDS incremental? i.e. If we removed support for v2 and then updated this patch accordingly, the amount of code to test/review to ship xDS incremental would reduce significantly.
I would like to start removing support for xDS v2 altogether, do you think it would be possible to coordinate that effort with adding support for xDS incremental? i.e. If we removed support for v2 and then updated this patch accordingly, the amount of code to test/review to ship xDS incremental would reduce significantly.
@rulex123 We don't want to merge these two topics, since these are two different topics. I have a local branch with almost removed v2 API and would like to push it as soon as we merge this pull request with delta XDS.
I would like to start removing support for xDS v2 altogether, do you think it would be possible to coordinate that effort with adding support for xDS incremental? i.e. If we removed support for v2 and then updated this patch accordingly, the amount of code to test/review to ship xDS incremental would reduce significantly.
@rulex123 We don't want to merge these two topics, since these are two different topics. I have a local branch with almost removed v2 API and would like to push it as soon as we merge this pull request with delta XDS.
I agree we shouldn't merge the two in one patch: what I am suggesting is that we first ship removal of support for xDS v2, and after that change has been merged upstream, we rebase the delta-xds-non-breaking-slonka-hash-bytes branch on main so that the scope of the code to review/test in this patch is reduced.
I think this is worth it, especially if you have a branch where you started the v2 removal work. I can also help with that work, in case you don't have cycles to continue it. WDYT?
I would like to start removing support for xDS v2 altogether, do you think it would be possible to coordinate that effort with adding support for xDS incremental? i.e. If we removed support for v2 and then updated this patch accordingly, the amount of code to test/review to ship xDS incremental would reduce significantly.
@rulex123 We don't want to merge these two topics, since these are two different topics. I have a local branch with almost removed v2 API and would like to push it as soon as we merge this pull request with delta XDS.
I agree we shouldn't merge the two in one patch: what I am suggesting is that we first ship removal of support for xDS v2, and after that change has been merged upstream, we rebase the
delta-xds-non-breaking-slonka-hash-bytesbranch onmainso that the scope of the code to review/test in this patch is reduced.I think this is worth it, especially if you have a branch where you started the v2 removal work. I can also help with that work, in case you don't have cycles to continue it. WDYT?
I will work on removing API v2 during this week, hopefully I will prepare first PR this week
:wave: any updates here? I'd be happy to help get this merged as I'm hoping to deploy this change soon
👋 any updates here? I'd be happy to help get this merged as I'm hoping to deploy this change soon
Hey, the work is done on this branch. We are waiting for the review and merging of this PR -> there is a problem with publishing artifact for JCP.
@rulex123 @mikegajda it is ready for your reviews
Taking a look at this in a moment @Ferdudas97! Thanks for getting it to this point