reverse-proxy icon indicating copy to clipboard operation
reverse-proxy copied to clipboard

Handle destination gray failures better (Degraded health state)

Open davidni opened this issue 2 years ago • 1 comments

What should we add or change to make your life better?

The existing health policies (e.g. ConsecutiveFailuresHealthPolicy) can be problematic when destinations are not entirely up nor entirely down. This can happen in practice, and the current policies produce sub-optimal routing decisions.

Why is this important to you?

This would help maintain high availability for external scenarios despite transient instability of internal services and/or machines.

Proposal

  • Introduce a new Degraded destination health state (see also: #1011)
  • Modify destination selection policy to prefer destinations in Healthy state rather than Destination, but still pick Degraded destinations if there are "few" (according to some criteria) Healthy ones
  • Modify active health checks to allow a destination to declare itself to be degraded if it would like to not be a preferred choice for traffic, while allowing YARP to make the final decision whether to use this destination or not based on all signals it has access to
  • Destination health transitions should avoid spurious Health and Unhealth determinations based on single health observations. The diagram below proposes a symmetric decision process where a transition from Degraded state requires three consecutive consistent health observations
--- 
title: Destination health state diagram (PROPOSAL)
--- 

stateDiagram-v2 
    [*] --> Unknown 
    Unknown --> Healthy: active probe == "success" 
    Unknown --> Unhealthy: active probe != "success" 
    Degraded --> Healthy: active probe == "success"\n(3rd consecutive) 
    Healthy --> Degraded: proxy result == "degraded" OR\nactive probe == "degraded" OR\nactive probe !="success" in 2 of past 5 evals 
    Degraded --> Unhealthy: active probe != "success"\n(3rd consecutive) 
    Unhealthy --> Degraded: active probe =="success"

NOTE: This is related to #1011 but goes beyond it in scope. Filing as a separate issue seemed appropriate.

davidni avatar Mar 15 '23 00:03 davidni

We should consider letting Affinity still target Degraded nodes (see https://github.com/microsoft/reverse-proxy/issues/2335#issuecomment-1821274266)

karelz avatar Nov 28 '23 18:11 karelz