reverse-proxy
                                
                                 reverse-proxy copied to clipboard
                                
                                    reverse-proxy copied to clipboard
                            
                            
                            
                        Handle destination gray failures better (Degraded health state)
What should we add or change to make your life better?
The existing health policies (e.g. ConsecutiveFailuresHealthPolicy) can be problematic when destinations are not entirely up nor entirely down. This can happen in practice, and the current policies produce sub-optimal routing decisions.
Why is this important to you?
This would help maintain high availability for external scenarios despite transient instability of internal services and/or machines.
Proposal
- Introduce a new Degradeddestination health state (see also: #1011)
- Modify destination selection policy to prefer destinations in Healthy state rather than Destination, but still pick Degraded destinations if there are "few" (according to some criteria) Healthy ones
- Modify active health checks to allow a destination to declare itself to be degraded if it would like to not be a preferred choice for traffic, while allowing YARP to make the final decision whether to use this destination or not based on all signals it has access to
- Destination health transitions should avoid spurious Health and Unhealth determinations based on single health observations. The diagram below proposes a symmetric decision process where a transition from Degraded state requires three consecutive consistent health observations
--- 
title: Destination health state diagram (PROPOSAL)
--- 
stateDiagram-v2 
    [*] --> Unknown 
    Unknown --> Healthy: active probe == "success" 
    Unknown --> Unhealthy: active probe != "success" 
    Degraded --> Healthy: active probe == "success"\n(3rd consecutive) 
    Healthy --> Degraded: proxy result == "degraded" OR\nactive probe == "degraded" OR\nactive probe !="success" in 2 of past 5 evals 
    Degraded --> Unhealthy: active probe != "success"\n(3rd consecutive) 
    Unhealthy --> Degraded: active probe =="success"
NOTE: This is related to #1011 but goes beyond it in scope. Filing as a separate issue seemed appropriate.
We should consider letting Affinity still target Degraded nodes (see https://github.com/microsoft/reverse-proxy/issues/2335#issuecomment-1821274266)