Handling passive health check on multiple instances of Yarp
I'm following this code, where its been implemented by the Yarp.
public class ThrottlingHealthPolicy : IPassiveHealthCheckPolicy
{
public static string ThrottlingPolicyName = "ThrottlingPolicy";
private readonly IDestinationHealthUpdater _healthUpdater;
public ThrottlingHealthPolicy(IDestinationHealthUpdater healthUpdater)
{
_healthUpdater = healthUpdater;
}
public string Name => ThrottlingPolicyName;
public void RequestProxied(HttpContext context, ClusterState cluster, DestinationState destination)
{
var headers = context.Response.Headers;
if (context.Response.StatusCode is 429 or >= 500)
{
var retryAfterSeconds = 10;
if (headers.TryGetValue("Retry-After", out var retryAfterHeader) && retryAfterHeader.Count > 0 && int.TryParse(retryAfterHeader[0], out var retryAfter))
{
retryAfterSeconds = retryAfter;
}
else
if (headers.TryGetValue("x-ratelimit-reset-requests", out var ratelimiResetRequests) && ratelimiResetRequests.Count > 0 && int.TryParse(ratelimiResetRequests[0], out var ratelimiResetRequest))
{
retryAfterSeconds = ratelimiResetRequest;
}
else
if (headers.TryGetValue("x-ratelimit-reset-tokens", out var ratelimitResetTokens) && ratelimitResetTokens.Count > 0 && int.TryParse(ratelimitResetTokens[0], out var ratelimitResetToken))
{
retryAfterSeconds = ratelimitResetToken;
}
_healthUpdater.SetPassive(cluster, destination, DestinationHealth.Unhealthy, TimeSpan.FromSeconds(retryAfterSeconds));
}
}
One of the limitation, is
This solution uses the local memory to store the endpoints health state. That means each instance will have its own view of the throttling state of each OpenAI endpoint. What might happen during runtime is this:
Instance 1 receives a customer request and gets a 429 error from backend 1. It marks that backend as unavailable for X seconds and then reroute that customer request to next backend Instance 2 receives a customer request and sends that request again to backend 1 (since its local cached list of backends didn't have the information from instance 1 when it marked as throttled). Backend 1 will respond with error 429 again and instance 2 will also mark it as unavailable and reroutes the request to next backend
Question:
Is there any other option to use for storing this endpoint health state in a centralized zone/place instead of local memory which may not work for multiple instances of Yarp?
We use Orleans for holding session affinity state and rate limiting state. It could similarly be used for health state. It would require some db or object storage for Orleans clustering but once you have that then it opens up a bunch of other potential use cases.
Thanks @rkargMsft for the suggestion. If you have any samples to share here for some reference