envoy icon indicating copy to clipboard operation
envoy copied to clipboard

upstream: implementation of outlier detection extensions

Open cpakulski opened this issue 9 months ago • 42 comments

Commit Message: upstream: implementation of outlier detection extensions

Additional Description: The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:

  • count http codes in range 500-503
  • count http codes in range 401-403
  • count locally originated errors (resets, timeouts)

So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not. The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).

This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to

  • use "old" outlier detection to react to 5xx errors and add another range of HTTP errors (say 4xx)
  • disable "old" outlier detection and configure everything using extensions

The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.

The implementation is built on top of already approved API for extensions: https://github.com/envoyproxy/envoy/pull/31205

Risk Level: Low (previous configuration still works. Extensions do not have to be configured) Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions Docs Changes: Yes. Added. Release Notes: Yes. Platform Specific Features: No Fixes #18789

cpakulski avatar May 14 '24 20:05 cpakulski

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/34154 was opened by cpakulski.

see: more, trace.

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @wbpcode CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/34154 was opened by cpakulski.

see: more, trace.

/retest

cpakulski avatar May 17 '24 18:05 cpakulski

@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps!

cpakulski avatar May 17 '24 21:05 cpakulski

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

nezdolik avatar May 20 '24 21:05 nezdolik

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

cpakulski avatar May 21 '24 15:05 cpakulski

Looks like the extension docs (*.rst) also need to be added.

Do you have a specific doc in mind? Maybe I missed something. I checked the docs output and "outlier detection" extensions are automatically added to docs based on proto's annotations.

cpakulski avatar May 21 '24 15:05 cpakulski

/retest

cpakulski avatar May 21 '24 23:05 cpakulski

/retest

cpakulski avatar May 22 '24 15:05 cpakulski

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

nezdolik avatar May 23 '24 10:05 nezdolik

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

OK. Let me try to add warnings that "old" consecutive errors will be deprecated. Thanks!

==================================================================================== Update: Actually after some thinking I believe that it would be best to start deprecating and displaying warnings when extensions's functionality is equivalent to the "old" outlier. I mean after success rate and failure frequency monitors are implemented. Once we have this PR merged, the framework is ready and implementing those missing monitors should not take too much time. The reason is that "old" outlier is "flat". Once one option is enabled, it actually enables all methods and we would like to display warning only when a user uses "old" consecutive_5xx or consecutive_gateway or consecutive_local_origin.

cpakulski avatar May 23 '24 19:05 cpakulski

/retest

cpakulski avatar May 24 '24 18:05 cpakulski

/retest

cpakulski avatar May 24 '24 23:05 cpakulski

@nezdolik @wbpcode Thanks for reviewing this PR. I think I addressed all the comments. In some cases, I provided explanation why I am not planning to change anything and leave it as is. I would appreciate if you could do another pass. Thanks!

cpakulski avatar May 27 '24 01:05 cpakulski

@cpakulski thank you for all the work done! i added one more comment and believe this is now ready for further review. @wbpcode do you have capacity to review this one? (otherwise will tag senior maintainers)

nezdolik avatar May 27 '24 21:05 nezdolik

/assign-from @envoyproxy/senior-maintainers

nezdolik avatar May 29 '24 10:05 nezdolik

@envoyproxy/senior-maintainers assignee is @wbpcode

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/34154#issuecomment-2137133218 was created by @nezdolik.

see: more, trace.

@wbpcode Thanks for your initial comments. I agree with you that this is large PR. The additional problem is that extensions should work along with "previous" implementation. I believe that code can be cleaner, but keeping backwards compatibility sometimes requires building not-very clean approach. Once we start deprecating "old" way we can start cleaning the code.

I addressed most of your comments. One or two comments require more thoughts and I will work on them tomorrow.

cpakulski avatar Jun 05 '24 23:06 cpakulski

/retest

cpakulski avatar Jun 11 '24 13:06 cpakulski

/retest

cpakulski avatar Jun 13 '24 00:06 cpakulski

Is this ready for a new review? cc @cpakulski

wbpcode avatar Jun 13 '24 02:06 wbpcode

@wbpcode Thanks for keeping an eye on this PR! I still have to refactor FACTORY to check config when it arrives not when monitor is created. I hope to finish it today and will ask you for another review.

cpakulski avatar Jun 13 '24 13:06 cpakulski

@wbpcode I addressed your major comments. There are few small issues left, like CODEOWNERS. I think it is ready for another review. Thanks a lot for great suggestions!

cpakulski avatar Jun 13 '24 23:06 cpakulski

/retest

cpakulski avatar Jun 18 '24 14:06 cpakulski

@wbpcode This is ready for review. CI fail is not related to my changes. Other PRs fail as well. Whenever you have time to review it. Thanks a lot!

cpakulski avatar Jun 19 '24 18:06 cpakulski

@wbpcode Friendly ping :-)

cpakulski avatar Jul 01 '24 01:07 cpakulski

/retest

cpakulski avatar Jul 11 '24 00:07 cpakulski

/wait

wbpcode avatar Jul 15 '24 06:07 wbpcode

@wbpcode Thanks for great feedback. I addressed all your comments and refactored the code.

cpakulski avatar Jul 17 '24 19:07 cpakulski

/retest

cpakulski avatar Jul 23 '24 01:07 cpakulski