envoy
envoy copied to clipboard
Outlier detection healthcheck UNEJECT is confusing
Title: Outlier detection healthcheck UNEJECT is confusing
Description: I had the following cluster configuration, where I wanted to disable consecutive_5xx completely and only enable it for consecutive_gateway_failure :
"outlier_detection": {
"consecutive_5xx": 100000, // this is deliberate. I wanted to turn it off completely as I don't want it to trigger on 500s
"interval": "2s",
"base_ejection_time": "10s",
"max_ejection_percent": 33,
"enforcing_consecutive_5xx": 0, // This is also supposed to turn off consecutive_5xx
"success_rate_minimum_hosts": 2,
"success_rate_request_volume": 20,
"consecutive_gateway_failure": 3,
"enforcing_consecutive_gateway_failure": 100,
"max_ejection_time": "30s",
"max_ejection_time_jitter": "1s"
},
After trying to kill a instance in the cluster I would see these kind of logs in the outlier detection logs
{
action: EJECT
cluster_name: bounty
eject_consecutive_event: {
}
enforced: true
num_ejections: 1
timestamp: 2024-07-08T18:00:50.740Z
type: CONSECUTIVE_GATEWAY_FAILURE
upstream_url: 10.0.209.198:24467
}
{
action: UNEJECT
cluster_name: bounty
enforced: false
num_ejections: 1
secs_since_last_action: 7
timestamp: 2024-07-08T18:00:58.521Z
type: CONSECUTIVE_5XX
upstream_url: 10.0.209.198:24467
}
{
action: EJECT
cluster_name: bounty
eject_consecutive_event: {
}
enforced: true
num_ejections: 2
secs_since_last_action: 0
timestamp: 2024-07-08T18:00:58.536Z
type: CONSECUTIVE_GATEWAY_FAILURE
upstream_url: 10.0.209.198:24467
}
The UNEJECT coming from type CONSECUTIVE_5XX was very confusing as:
- EJECT was triggered by Gateway failure and not 500
- It unejected from CONSECUTIVE_5XX even though the config is set to a very high threshold and enforcing_consecutive_5xx is set to 0.
- The log line also had
enforced: false
which should have meant it didn't action but in fact it actually did UNEJECT. The documentation is also confusing as it states to only be relevant for action eject, but is logged for uneject alsoIf action is eject, specifies if the ejection was enforced. true means the host was ejected. false means the event was logged but the host was not actually ejected.
It took me a while to figure out why was this happening and that in fact this was because I had my healthcheck configuration set with unhealthy_threshold: 2
and it was the healthcheck(even if first failed, cluster was still considered healthy) that was triggering the UNEJECT and not the CONSECUTIVE_5XX. I managed to get my desired configuration by setting successful_active_health_check_uneject_host to false. It would probably have helped if the section about health checking in the docs were under ejection algorithm and not under gRPC (proposing that change under #35185)
Expected Behaviour Can the logging be changed so that UNEJECT events triggered by an active health check are of type HEALTH_CHECK instead of defaulting to CONSECUTIVE_5XX?
Relevant Links: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm https://www.envoyproxy.io/docs/envoy/latest/api-v3/data/cluster/v3/outlier_detection_event.proto#data-cluster-v3-outlierdetectionevent