java-control-plane icon indicating copy to clipboard operation
java-control-plane copied to clipboard

Send EDS response in ADS mode when Envoy reconnect and asks with Empty version

Open lukidzi opened this issue 5 years ago • 3 comments
trafficstars

We found that Envoy has a cluster in warming state after reconnect to new control-plane. So let's assume Envoy is connected to Control-plane1. Control-plane1 receive information about the new cluster[Cluster1] and send CDS response to Envoy using ADS. Control-plane1 die and at the same time cluster[Cluster1] has been removed. Envoy connects to Control-plane2 and send[EDS,LDS,RDS] requests. Because Envoy didn't received response EDS from Control-plane1 this cluster is in warming state. EDS request from Envoy to Control-plane2contains resource name of removed cluster[Cluster1] and Control-plane2 doesn't know about the cluster which was sent by Control-plane1 to Envoy. Because ADS implementation requires to send only response when all requested resources in snapshot Control-plane2 won’t respond. In this situation Envoy stay in warming phase for ever.

lukidzi avatar Jan 30 '20 14:01 lukidzi

Codecov Report

Merging #127 into master will increase coverage by 0.28%. The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #127      +/-   ##
============================================
+ Coverage      89.5%   89.78%   +0.28%     
- Complexity      193      198       +5     
============================================
  Files            22       22              
  Lines           600      617      +17     
  Branches         48       53       +5     
============================================
+ Hits            537      554      +17     
  Misses           48       48              
  Partials         15       15
Impacted Files Coverage Δ Complexity Δ
.../io/envoyproxy/controlplane/cache/SimpleCache.java 84.84% <100%> (+2.23%) 35 <0> (+5) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 046ab2b...6b1a5f1. Read the comment docs.

codecov-io avatar Jan 30 '20 14:01 codecov-io

@lukidzi is this still relevant?

slonka avatar Sep 22 '20 18:09 slonka

This does seem like a race that could get Envoy stuck, so it probably is worth merging in one form or another.

I think probably the logic could be:

  1. If there is a "missing" ClusterLoadAssignment that Envoy asked for, check if there is a corresponding cluster in the Snapshot. If there is not, respond with an empty EDS message (because probably it's this race, where the cluster was deleted but Envoy didn't see a CDS update removing it).

  2. If the cluster DOES exist in the snapshot, then either 1) the caller generated an inconsistent Snapshot as a bug or 2) the caller will provide a subsequent Snapshot that has the endpoints. I think in either of these cases, we could respond with an empty EDS if the acked version is "", otherwise we should keep the current behavior of not responding. I could also be convinced to just keep the current behavior of not responding, although it's probably better to return something to envoy so other clusters/endpoints don't get stuck on a single bad/missing resource.

mpuncel avatar Sep 28 '20 16:09 mpuncel