java-control-plane
java-control-plane copied to clipboard
Send EDS response in ADS mode when Envoy reconnect and asks with Empty version
We found that Envoy has a cluster in warming state after reconnect to new control-plane.
So let's assume Envoy is connected to Control-plane1. Control-plane1 receive information about the new cluster[Cluster1] and send CDS response to Envoy using ADS. Control-plane1 die and at the same time cluster[Cluster1] has been removed. Envoy connects to Control-plane2 and send[EDS,LDS,RDS] requests. Because Envoy didn't received response EDS from Control-plane1 this cluster is in warming state. EDS request from Envoy to Control-plane2contains resource name of removed cluster[Cluster1] and Control-plane2 doesn't know about the cluster which was sent by Control-plane1 to Envoy. Because ADS implementation requires to send only response when all requested resources in snapshot Control-plane2 won’t respond. In this situation Envoy stay in warming phase for ever.
Codecov Report
Merging #127 into master will increase coverage by
0.28%. The diff coverage is100%.
@@ Coverage Diff @@
## master #127 +/- ##
============================================
+ Coverage 89.5% 89.78% +0.28%
- Complexity 193 198 +5
============================================
Files 22 22
Lines 600 617 +17
Branches 48 53 +5
============================================
+ Hits 537 554 +17
Misses 48 48
Partials 15 15
| Impacted Files | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| .../io/envoyproxy/controlplane/cache/SimpleCache.java | 84.84% <100%> (+2.23%) |
35 <0> (+5) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 046ab2b...6b1a5f1. Read the comment docs.
@lukidzi is this still relevant?
This does seem like a race that could get Envoy stuck, so it probably is worth merging in one form or another.
I think probably the logic could be:
-
If there is a "missing" ClusterLoadAssignment that Envoy asked for, check if there is a corresponding cluster in the Snapshot. If there is not, respond with an empty EDS message (because probably it's this race, where the cluster was deleted but Envoy didn't see a CDS update removing it).
-
If the cluster DOES exist in the snapshot, then either 1) the caller generated an inconsistent Snapshot as a bug or 2) the caller will provide a subsequent Snapshot that has the endpoints. I think in either of these cases, we could respond with an empty EDS if the acked version is
"", otherwise we should keep the current behavior of not responding. I could also be convinced to just keep the current behavior of not responding, although it's probably better to return something to envoy so other clusters/endpoints don't get stuck on a single bad/missing resource.