envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Fix healthchecks that do not persist when priority is changed

Open shulin-sq opened this issue 1 year ago • 7 comments

Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest.

Additional Description: Risk Level: Med? Testing: see note at the bottom Docs Changes: n/a Release Notes: added a note to the changelog Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] https://github.com/envoyproxy/envoy/issues/35243 [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

Explanation

This was a difficult bug to spot because it only happens when priority changes multiple times

For example let's consider this situation:

  • we have an endpoint A that swaps between priority 0, 1, and 2
  • we assume that priority is always processed starting from 0 and going up (eg, 0, 1, 2, 3, 4... etc)
  • mutable_cross_priority_host_map_ is the "list" in the situation that includes all endpoints from all priorities

When priority number goes up things are ok

0 -> 1 processing priority 0: remove A processing priority 1: add A

1 -> 2 priority 1: remove A priority 2: add A

but things get weird when numbers go down

2 -> 1, things are still peaceful here priority 1: add A (but this gets ignored since A is already in the list) priority 2: remove A (!!!)

1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint.

without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8 with the fix: https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1

Testing

  • functional test described in the issue: https://github.com/envoyproxy/envoy/issues/35243
  • unit test, confirmed that it fails before the fix is applied

shulin-sq avatar Aug 20 '24 01:08 shulin-sq

Hi @shulin-sq, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/35748 was opened by shulin-sq.

see: more, trace.

Thanks for tackling this! Can you please fix DCO and format?

adisuissa avatar Aug 20 '24 13:08 adisuissa

@adisuissa thanks for pointing it out!

I made some changes

  • ran the format script tools/local_fix_format.sh -all
  • added DCO
  • squashed my commits
  • fixed some of my comments on the new part of the unit test

shulin-sq avatar Aug 20 '24 17:08 shulin-sq

Assigning @wbpcode for senior-maintainer review. /assign @wbpcode

adisuissa avatar Aug 22 '24 14:08 adisuissa

/wait

adisuissa avatar Aug 23 '24 15:08 adisuissa

/retest

shulin-sq avatar Aug 26 '24 17:08 shulin-sq

@adisuissa @wbpcode I have updated the PR to address the comments from the last round.

shulin-sq avatar Aug 26 '24 21:08 shulin-sq

@wbpcode gentle nudge to review

ggreenway avatar Aug 29 '24 20:08 ggreenway

I will get some free time to review this PR this weekend. Thanks. :)

wbpcode avatar Aug 30 '24 01:08 wbpcode

Could you merge the main to kick the CI?

wbpcode avatar Sep 04 '24 14:09 wbpcode

@wbpcode done. Can you help with merging?

shulin-sq avatar Sep 04 '24 20:09 shulin-sq

@wbpcode just checking in, I did a rebase from master. did that not kick off the CI?

shulin-sq avatar Sep 06 '24 18:09 shulin-sq