java-control-plane
java-control-plane copied to clipboard
Delta xds
Working implementation of Incremental XDS. Has been running in production for several months primarily using DeltaXDS over ADS. This has no tests, but current tests should be running
Couple of implementation comments:
- This implementation does not support resource aliases, only names.
- Snapshot API has been broken to force clients to wrap their resources in a SnapshotResource which contains the given resources and its version
- Snapshot global version is still maintained and no changes will be processed if it doesn't change, even though internal resources may have changed.
- DeltaDiscoveryRequestStreamObservers keeps track of the clients known resource versions as well as pending resources. This resource knowledge is also handed to DeltaWatches as copies.
- Client known Resource versions are updated whenever a client acknowledges a response sent by the control plane considering only the resources versions sent in that response.
- Delta responses do not provide the client global version, so global snapshot versions are inferred when clients acknowledge a response for a version.
- Delta requires some type urls to be handled in a "Wildcard Mode", currently this is hardcoded to CDS and LDS. If a Stream is in wildcard mode, it will receive all new resources even without needing to be subscribed.
- When setting a new snapshot, to prevent full snapshot diff with every watch which can be very expensive, a diff is calculated with the previous snapshot. Only resources that have changed their version or been deleted are then used to notify watches.
- A watch is responded only if a new resource is added, a resource it knows has been updated been removed.
- Special conditions have been added to prevent having more than one in flight response for a given TypeURL, even though this is supported by envoy and the protocol, it introduces more complexity and can lead to subtle issues.
- Even though Snapshot version may have changed, it may be that internal resources have not changed and so, no notifications will occur.
I don't have time to add enough tests to get this merged, but i can tell you this is battle tested. Most of the design decisions specified above were taken to solve actual issues while handling delta in production, we have huge configs with 9k+ clusters and constant changes and we did find edge cases to fix. Delta is significantly different than SOTW and thoroughly testing it can be tricky.
I'm leaving this open in a Draft state, maybe someone can pick this up and get it merged.
Codecov Report
Merging #152 into master will decrease coverage by
27.00%. The diff coverage is10.98%.
@@ Coverage Diff @@
## master #152 +/- ##
=============================================
- Coverage 88.02% 61.02% -27.01%
- Complexity 296 303 +7
=============================================
Files 31 39 +8
Lines 977 1465 +488
Branches 78 119 +41
=============================================
+ Hits 860 894 +34
- Misses 85 538 +453
- Partials 32 33 +1
| Impacted Files | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| ...o/envoyproxy/controlplane/cache/DeltaResponse.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| ...a/io/envoyproxy/controlplane/cache/DeltaWatch.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| ...envoyproxy/controlplane/cache/DeltaXdsRequest.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| ...proxy/controlplane/cache/GroupCacheStatusInfo.java | 50.00% <0.00%> (-16.67%) |
2.00 <0.00> (ø) |
|
| ...ava/io/envoyproxy/controlplane/cache/Snapshot.java | 100.00% <ø> (ø) |
5.00 <0.00> (ø) |
|
| ...server/AdsDeltaDiscoveryRequestStreamObserver.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| ...lane/server/AdsDiscoveryRequestStreamObserver.java | 96.77% <ø> (-0.11%) |
16.00 <0.00> (ø) |
|
| ...ne/server/DeltaDiscoveryRequestStreamObserver.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| .../controlplane/server/DiscoveryServerCallbacks.java | 14.28% <0.00%> (-5.72%) |
1.00 <0.00> (ø) |
|
| ...trolplane/server/LatestDeltaDiscoveryResponse.java | 0.00% <0.00%> (ø) |
0.00 <0.00> (?) |
|
| ... and 22 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update d71b9a2...f8bf24d. Read the comment docs.
Nice! Do you have any insights on the performance improvements on SOTW vs Delta?
@jakubdyszkiewicz It really depends on the use case. For example:
- If using ADS, it would see a huge jump in performance since only changed resources would get propagated.
- If not using ADS, CDS, LDS and RDS updates would get huge performance benefits, since only real changes are propagated. EDS is a different story, when not using ADS EDS generates a new stream for every Cluster and this is still true in Delta, so you wouldn't get much difference there, but still, only updated cluters would get updated so there would still be a lot of benefit unless you were using per cluster versioning of EDS in which case you could probably see little to no difference.
The performance benefit is really hard to measure, in our use case we were using non ADS SOTW with per cluster versioning which is probably the best performing case you can make of SOTW and we managed to reduce about 50% the load on our Control Plane, the amount of updates pushed from our Control Plane went from 1.5M changes per minute to 25k per minute.
But not only Control Plane performance was improved, also envoy was using less cpu when CDS was updated for example, which in our use case is huge since we have 9k+ clusters.
The key take away is:
- If you're pushing lots of updates constantly, then Delta will be a huge performance gain
- If you're not using ADS, with Delta you probably should, at least for EDS.
What's the state of this and are there plans to get this merged in? @sschepens how has this been working out in production for you?
I'm working on a project that could stand to benefit from this, and could potentially offer to write the tests here, depending on what the plan is for this.
What's the state of this and are there plans to get this merged in? @sschepens how has this been working out in production for you?
@mikegajda this has been running in production for 6+ months and is working like a charm.
I'm working on a project that could stand to benefit from this, and could potentially offer to write the tests here, depending on what the plan is for this.
That would be great! I think what we would need is Design input from other people, since this breaks existing APIs, it would be great to see if this change is the best or if some other ideas come about.
We would love to see this implemented. @mikegajda Allegro's OSS Control Plane can act as a second level of tests (we depend on JCP and have integration and resiliency tests). If you need help getting started reach out to me directly on Envoy's slack.
Thanks for the comment @slonka and pointing me at the allegro control plane stuff, I'll reach out on slack if I run into anything
:wave: I have been working here on a version of this that has some testing (working on more) and changes how versions are generated, would love to hear some thoughts/comments about this approach: https://github.com/HubSpot/java-control-plane/pull/3
Unless I am missing something, the work from this PR is already merged and we should close this.
Yes, this can be closely now, thanks @inssein for noticing that