emissary
emissary copied to clipboard
Add support for active health checking
Description
Adds support for active healthchecking. If an upstream cluster (envoy cluster) fails its configured active health check threshold then envoy will no longer route requests to it.
Related Issues
List related issues.
Testing
A few sentences describing what testing you've done, e.g., manual tests, automated tests, deployed in production, etc.
Checklist
-
[x] I made sure to update
CHANGELOG.md
.Remember, the CHANGELOG needs to mention:
- Any new features
- Any changes to our included version of Envoy
- Any non-backward-compatible changes
- Any deprecations
-
[x] This is unlikely to impact how Ambassador performs at scale.
Remember, things that might have an impact at scale include:
- Any significant changes in memory use that might require adjusting the memory limits
- Any significant changes in CPU use that might require adjusting the CPU limits
- Anything that might change how many replicas users should use
- Changes that impact data-plane latency/scalability
-
[ ] My change is adequately tested.
Remember when considering testing:
- Your change needs to be specifically covered by tests.
- Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
- You also need to make sure that the entire area being changed has adequate test coverage.
- If existing tests don't actually cover the entire area being changed, add tests.
- This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
- We should lean on the bulk of code being covered by unit tests, but...
- ... an end-to-end test should cover the integration points
- Your change needs to be specifically covered by tests.
-
[ ] I updated
DEVELOPING.md
with any any special dev tricks I had to use to work on this code efficiently. -
[x] The changes in this PR have been reviewed for security concerns and adherence to security best practices.
Perhaps a future consideration would be to create a HealthChecking CRD that defines a single set of config for HealthChecking that can be used as a default config for either all or a subset of upstreams. I don't think that is necessary at this time though.
fixes https://github.com/datawire/apro/issues/2431 and helps with https://github.com/datawire/apro/issues/2911 ?
fixes datawire/apro#2431 and helps with datawire/apro#2911 ?
@KatieLo For https://github.com/datawire/apro/issues/2431 Envoy's active healthchecking is not the same thing as Kubernetes healthchecking, but they sound like they would both resolve the same problem.
With Envoy's active healthchecking you can specify (on a Mapping) what the healthchecks should look like (what path/port what the interval of checks should be, how many failed checks it takes to mark an upstream as unhealthy, and how many successful checks it takes for an unhealthy upstream to be marked as healthy again.)
If you have 10 pods and 2 of them beome unhealthy for whatever reason (maybe they are overloaded or having transient problems) then Envoy will stop routing requests to those unhealthy pods but continue healthchecking them. Once they become healthy it will start routing requests to them again. This lets those pods have a chance to recover while keeping requests going to only to the pods that are known to be healthy.
For https://github.com/datawire/apro/issues/2911 I'm affraid I don't know too much about the graphql endpoing, but if triggering a restart of the Ambassador pods solves the problem then it sounds more like a problem with Ambassador than the upstream pods.
Of course I'll write some docs for this to explain the usage and whatnot when we decide it's good to merge :wink: