envoy
envoy copied to clipboard
upstream: implementation of outlier detection extensions
Commit Message: upstream: implementation of outlier detection extensions
Additional Description: The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link
In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:
- count http codes in range 500-503
- count http codes in range 401-403
- count locally originated errors (resets, timeouts)
So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not. The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).
This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to
- use "old" outlier detection to react to 5xx errors and add another range of HTTP errors (say 4xx)
- disable "old" outlier detection and configure everything using extensions
The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.
The implementation is built on top of already approved API for extensions: https://github.com/envoyproxy/envoy/pull/31205
Risk Level: Low (previous configuration still works. Extensions do not have to be configured) Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions Docs Changes: Yes. Added. Release Notes: Yes. Platform Specific Features: No Fixes #18789
As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.
Please mark your PR as ready when you want it to be reviewed!
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/)
.
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/)
.
/retest
@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps!
will the old outlier detection extension be eventually deprecated? (given that the new one lands)
will the old outlier detection extension be eventually deprecated? (given that the new one lands)
It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency
, success rate
always run regardless whether users use them to eject nodes or not.
Looks like the extension docs (*.rst) also need to be added.
Do you have a specific doc in mind? Maybe I missed something. I checked the docs output and "outlier detection" extensions are automatically added to docs based on proto's annotations.
/retest
/retest
will the old outlier detection extension be eventually deprecated? (given that the new one lands)
It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like
frequency
,success rate
always run regardless whether users use them to eject nodes or not.
Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.
will the old outlier detection extension be eventually deprecated? (given that the new one lands)
It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like
frequency
,success rate
always run regardless whether users use them to eject nodes or not.Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.
OK. Let me try to add warnings that "old" consecutive errors will be deprecated. Thanks!
====================================================================================
Update:
Actually after some thinking I believe that it would be best to start deprecating and displaying warnings when extensions's functionality is equivalent to the "old" outlier. I mean after success rate
and failure frequency
monitors are implemented. Once we have this PR merged, the framework is ready and implementing those missing monitors should not take too much time. The reason is that "old" outlier is "flat". Once one option is enabled, it actually enables all methods and we would like to display warning only when a user uses "old" consecutive_5xx
or consecutive_gateway
or consecutive_local_origin
.
/retest
/retest
@nezdolik @wbpcode Thanks for reviewing this PR. I think I addressed all the comments. In some cases, I provided explanation why I am not planning to change anything and leave it as is. I would appreciate if you could do another pass. Thanks!
@cpakulski thank you for all the work done! i added one more comment and believe this is now ready for further review. @wbpcode do you have capacity to review this one? (otherwise will tag senior maintainers)
/assign-from @envoyproxy/senior-maintainers
@envoyproxy/senior-maintainers assignee is @wbpcode
@wbpcode Thanks for your initial comments. I agree with you that this is large PR. The additional problem is that extensions should work along with "previous" implementation. I believe that code can be cleaner, but keeping backwards compatibility sometimes requires building not-very clean approach. Once we start deprecating "old" way we can start cleaning the code.
I addressed most of your comments. One or two comments require more thoughts and I will work on them tomorrow.
/retest
/retest
Is this ready for a new review? cc @cpakulski
@wbpcode Thanks for keeping an eye on this PR! I still have to refactor FACTORY to check config when it arrives not when monitor is created. I hope to finish it today and will ask you for another review.
@wbpcode I addressed your major comments. There are few small issues left, like CODEOWNERS. I think it is ready for another review. Thanks a lot for great suggestions!
/retest
@wbpcode This is ready for review. CI fail is not related to my changes. Other PRs fail as well. Whenever you have time to review it. Thanks a lot!
@wbpcode Friendly ping :-)
/retest
/wait
@wbpcode Thanks for great feedback. I addressed all your comments and refactored the code.
/retest