envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Outlier detection healthcheck UNEJECT is confusing

Open NersesAM opened this issue 7 months ago • 7 comments

Title: Outlier detection healthcheck UNEJECT is confusing

Description: I had the following cluster configuration, where I wanted to disable consecutive_5xx completely and only enable it for consecutive_gateway_failure :

     "outlier_detection": {
       "consecutive_5xx": 100000, // this is deliberate. I wanted to turn it off completely as I don't want it to trigger on 500s
       "interval": "2s",
       "base_ejection_time": "10s",
       "max_ejection_percent": 33,
       "enforcing_consecutive_5xx": 0, // This is also supposed to turn off consecutive_5xx
       "success_rate_minimum_hosts": 2,
       "success_rate_request_volume": 20,
       "consecutive_gateway_failure": 3,
       "enforcing_consecutive_gateway_failure": 100,
       "max_ejection_time": "30s",
       "max_ejection_time_jitter": "1s"
      },

After trying to kill a instance in the cluster I would see these kind of logs in the outlier detection logs

{
   action: EJECT
   cluster_name: bounty
   eject_consecutive_event: {
   }
   enforced: true
   num_ejections: 1
   timestamp: 2024-07-08T18:00:50.740Z
   type: CONSECUTIVE_GATEWAY_FAILURE
   upstream_url: 10.0.209.198:24467
}

{
   action: UNEJECT
   cluster_name: bounty
   enforced: false
   num_ejections: 1
   secs_since_last_action: 7
   timestamp: 2024-07-08T18:00:58.521Z
   type: CONSECUTIVE_5XX
   upstream_url: 10.0.209.198:24467
}

{
   action: EJECT
   cluster_name: bounty
   eject_consecutive_event: {
   }
   enforced: true
   num_ejections: 2
   secs_since_last_action: 0
   timestamp: 2024-07-08T18:00:58.536Z
   type: CONSECUTIVE_GATEWAY_FAILURE
   upstream_url: 10.0.209.198:24467
}

The UNEJECT coming from type CONSECUTIVE_5XX was very confusing as:

  1. EJECT was triggered by Gateway failure and not 500
  2. It unejected from CONSECUTIVE_5XX even though the config is set to a very high threshold and enforcing_consecutive_5xx is set to 0.
  3. The log line also had enforced: false which should have meant it didn't action but in fact it actually did UNEJECT. The documentation is also confusing as it states to only be relevant for action eject, but is logged for uneject also If action is eject, specifies if the ejection was enforced. true means the host was ejected. false means the event was logged but the host was not actually ejected.

It took me a while to figure out why was this happening and that in fact this was because I had my healthcheck configuration set with unhealthy_threshold: 2 and it was the healthcheck(even if first failed, cluster was still considered healthy) that was triggering the UNEJECT and not the CONSECUTIVE_5XX. I managed to get my desired configuration by setting successful_active_health_check_uneject_host to false. It would probably have helped if the section about health checking in the docs were under ejection algorithm and not under gRPC (proposing that change under #35185)

Expected Behaviour Can the logging be changed so that UNEJECT events triggered by an active health check are of type HEALTH_CHECK instead of defaulting to CONSECUTIVE_5XX?

Relevant Links: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm https://www.envoyproxy.io/docs/envoy/latest/api-v3/data/cluster/v3/outlier_detection_event.proto#data-cluster-v3-outlierdetectionevent

NersesAM avatar Jul 15 '24 10:07 NersesAM