envoy
envoy copied to clipboard
Fix healthchecks that do not persist when priority is changed
Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest.
Additional Description: Risk Level: Med? Testing: see note at the bottom Docs Changes: n/a Release Notes: added a note to the changelog Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] https://github.com/envoyproxy/envoy/issues/35243 [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]
Explanation
This was a difficult bug to spot because it only happens when priority changes multiple times
For example let's consider this situation:
- we have an endpoint A that swaps between priority 0, 1, and 2
- we assume that priority is always processed starting from 0 and going up (eg, 0, 1, 2, 3, 4... etc)
- mutable_cross_priority_host_map_ is the "list" in the situation that includes all endpoints from all priorities
When priority number goes up things are ok
0 -> 1 processing priority 0: remove A processing priority 1: add A
1 -> 2 priority 1: remove A priority 2: add A
but things get weird when numbers go down
2 -> 1, things are still peaceful here priority 1: add A (but this gets ignored since A is already in the list) priority 2: remove A (!!!)
1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint.
without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8 with the fix: https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1
Testing
- functional test described in the issue: https://github.com/envoyproxy/envoy/issues/35243
- unit test, confirmed that it fails before the fix is applied
Hi @shulin-sq, welcome and thank you for your contribution.
We will try to review your Pull Request as quickly as possible.
In the meantime, please take a look at the contribution guidelines if you have not done so already.
Thanks for tackling this! Can you please fix DCO and format?
@adisuissa thanks for pointing it out!
I made some changes
- ran the format script
tools/local_fix_format.sh -all - added DCO
- squashed my commits
- fixed some of my comments on the new part of the unit test
Assigning @wbpcode for senior-maintainer review. /assign @wbpcode
/wait
/retest
@adisuissa @wbpcode I have updated the PR to address the comments from the last round.
@wbpcode gentle nudge to review
I will get some free time to review this PR this weekend. Thanks. :)
Could you merge the main to kick the CI?
@wbpcode done. Can you help with merging?
@wbpcode just checking in, I did a rebase from master. did that not kick off the CI?