semantic-kernel .Net Redis connector and resilience

I'm trying out Microsoft.SemanticKernel.Connectors.Redis and following the docs that show how to configure it. Here's my actual service registration:

builder.Services.AddScoped<ISemanticTextMemory>(services =>
{
    var embedder = services.GetRequiredService<ITextEmbeddingGenerationService>();
    var redisDb = services.GetRequiredService<IConnectionMultiplexer>().GetDatabase();
    var memory = new RedisMemoryStore(redisDb, vectorSize: 384);
    return new SemanticTextMemory(memory, embedder);
});

And then elsewhere in my application, I do:

var semanticTextMemory = scope.ServiceProvider.GetRequiredService<ISemanticTextMemory>();
await semanticTextMemory.SaveReferenceAsync(...);

The problem is that, when coded in this naive way, it's not reliable. I'm encountering two classes of errors:

During startup, the SaveReferenceAsync call may fail with StackExchange.Redis.RedisServerException: LOADING Redis is loading the dataset in memory. Digging into the logs from the Redis instance, I see there's a step <search> creating vector index that can take a while if it's reloading pre-seeded data from a bind mount.
Even if startup succeeds, any subsequent SaveReferenceAsync call may fail with StackExchange.Redis.RedisTimeoutException: Timeout awaiting response. I ran into this after inserting about 30k reference entries (perhaps the RediSearch instance was rebuilding some part of the HNSW index). I know I could extend the timeout from its default 5s, but the SK docs don't tell me to do that.

So my overall question is: What role do you see Semantic Kernel playing within the whole business of resilience?

I could see the possible answers being:

SK is meant to help with resilience. Each connector, like RedisMemoryStore, is supposed to implement its own retries, plus the docs should describe what configuration or app-level logic is needed to help with this.
- For example, the SK docs should tell you to set a Redis query timeout much longer than 5s based on the assumed use case of building an HNSW index.
Or, SK is not meant to help with resilience. Application logic needs to assume that all calls into all SK abstractions may fail, and needs to implement its own app-specific retry rules.
- This would be logically consistent but arguably undermines the concept of these abstractions if you still have to understand all the possible failure modes of all possible backends.

But really I'd like to get the SK team's take on what role you want to play, or not, in app resilience in general.

Mar 27 '24 18:03 SteveSandersonMS

The intent has been that components should get from DI what they need, and resilience should be provided via DI. For example, components get an HttpClient from DI, you use Microsoft.Extensions.Http.Resilience to add a resilience handler into your services:

services.ConfigureHttpClientDefaults(b => b.AddStandardResilienceHandler());

and resiliency is handled in that regard. We don't want every individual component to need to build its own support, nor do we want SK itself to duplicate other efforts in the ecosystem, but rather to build upon it. I'd hope if the RedisMemoryStore is using HttpClient that it's getting it from DI, in which case this approach should work. If it's using HTTP but not getting it from DI, we should fix that. I could imagine that's the case given that the memory stuff is all in flux and hasn't been overhauled for DI yet (but that needs to happen asap).

cc: @dmytrostruk

Mar 27 '24 18:03 stephentoub

That makes sense conceptually, though I'd like to see this covered in the getting started docs for the Redis connector if every realistic app is going to need it.

From the stack trace, I see RedisMemoryStore uses NRedisStack which in turn uses StackExchange.Redis. Based on searching in the StackExchange.Redis sources, it does not appear to be built on HttpClient (and this comment from @mgravell backs that up). So my guess is that some other resilience mechanism is required.

From a web search, the first thing that comes up is https://github.com/maca88/StackExchange.Redis.Resilience. Don't know if there's something more mainstream but that's got very low usage and hasn't been updated in a year, so am not sure if it's what we would recommend. Also, having just tried it, it did not fix my LOADING Redis is loading the dataset in memory error. As such I'm unsure what we'd really recommend.

Mar 27 '24 18:03 SteveSandersonMS

Speaking about resilience in general, for each connector it will be probably configured in different ways, because in some connectors we use HttpClient, and as @stephentoub mentioned, we should get it from DI with configured resilience strategy. In other cases, we use SDK provided by vector DB (like in this case with Redis, and this is our preferred approach), and in this case it really depends what resilience options are provided by specific connector SDK (it could be possible to inject HttpClient from DI, configure resilience with some public API from SDK or implement some custom wrapper around it if nothing is provided out-of-the-box).

Taking into account different implementation for each vector DB provider, I'm not really sure that SK abstraction is a good place to unify resilience configuration for every connector. Although, I wouldn't probably deny this approach completely, and it would be interesting to experiment with abstraction and see what's possible. And we will probably do that during refactoring process for memory connectors.

I also agree that for each connector there should be a documentation that will describe some aspects of working with it. Although, I don't think we need to provide entire manual, since it should already exist for each specific connector on their resources, instead I would focus more on information that will be useful in the context of Semantic Kernel.

Mar 28 '24 00:03 dmytrostruk

I think there's a few things here.

First, @dmytrostruk, we should ensure that each of these connectors has IServiceCollection extensions that enables adding the component into DI with overloads that enable it to fetch the required clients from DI. Then when you use Aspire and its Redis component, which puts a redis client into DI, you can just write:

services.AddRedisMemoryStore(vectorSize: 386)

or whatever, and it internally will query for the IConnectionMultiplexer or whatever it needs. (I realize all of this stuff is being overhauled, so substitute in whatever the new names/concepts are for the existing.)

Second, I'd hope that the Aspire component would be able to configure everything according to best practices, including for reliability; then when these SK components pick up what they need from DI, they're picking up the appropriately configured support. But from chatting today with @eerhardt, it sounds like there's nothing in the Redis library that enables this post connection establishment. @nickcraver, what is the recommendation for resiliency with these components, do you know?

Mar 28 '24 17:03 stephentoub

I'm trying to follow here on "post connection establishment", I'm not familiar with this library on top but for the basics of StackExchange.Redis (I'm assuming latest version here - if we're using an older version this may be bad advice in which case: I'd recommend upgrading for sure):

The ConnectionMultiplexer connects - this seems to be successful as commands issued after connecting are erroring with the redis reloading from disk (this depends on config whether it saves to disk at all or is totally ephemeral).
After the multiplexer connects, it will continue to handle failures and reconnects internally, but we expose events for all of this and the ConfigurationOptions.LoggerFactory will emit these as well.
If we are disconnected, by default, we will buffer commands for up to 5 seconds to retry when a connection comes up. If we can't connect, a timeout exception on these commands is thrown. The 5 seconds here is whatever the AsyncTimeout is set to in configuration options - ultimately we check the head of the queue if anything has proceeded past that threshold on every heartbeat. Further, the heartbeat is adjustable via HeartbeatInterval.

Docs for the configs I describe above are here: https://stackexchange.github.io/StackExchange.Redis/Configuration

Importantly we offer no inherent retry mechanism per command, we only buffer and first-try commands during disconnect events. This is because the behavior of what's desired in a retry varied wildly - we recommend using Polly or similar for that level of retry. The StackExchange.Redis aim is to have a connection open and available for you to use, as often as possible.

If context helps: Given the multiplexed nature of our pipeline connection, retries aren't something we can reasonably do in a universal way, so we don't pretend to do so. Unlike HTTP where you get feedback about each request independently, in a multiplexed ordered pipeline scenario we don't know what reached or didn't reach the remote endpoint, only we sent X commands and none of them made it back. They could have dropped on the way there or back.

Mar 29 '24 22:03 NickCraver

This issue is stale because it has been open for 90 days with no activity.

Jun 28 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jul 12 '24 01:07 github-actions[bot]