garnet icon indicating copy to clipboard operation
garnet copied to clipboard

StringGet randomly hangs under heavy load

Open Hermholtz opened this issue 1 year ago • 2 comments

Hi! I wrote some dirty benchmark code and observed that with an increased number of concurrent in-flight queries using the StringGet method with the callback, the chances for it to hang increases. During the hang I observe high CPU usage on the machine where the test code is executed and the execution is somewhere within StringGet. That's all I know now :-)

There are chances that StringSet might hang or complete very slowly, too.

I think some internal structure in Garnet client is getting overloaded, or there's maybe an exponential complexity somewhere in the client kicking in, not sure about that. This way or another, increasing the value of NUM_TASKS in the code below makes the hang more likely, while decreasing it definitely improves the probability of timely and performant execution.

Please check it and fix if necessary.

This code is copied from Visual Studio and is using top-level statements.

/*
Launch Garnet server with:
$ cd main\GarnetServer
$ dotnet run -c Release -f net8.0
*/

using Garnet.client;
using System.Diagnostics;

Console.WriteLine("Connecting to Garnet");

using GarnetClient client = new GarnetClient("127.0.0.1", 3278);

client.Connect();

await client.ExecuteForStringResultAsync("FLUSHDB");

Console.WriteLine("Inserting data");

//int NUM_TASKS = Environment.ProcessorCount / 2;
int NUM_TASKS = 24;
var tasks = new List<Task>();

const int NUM_KEYS = 1_000_000;
const string VALUE_TO_INSERT = "value1";
long globalCounter = 0;

Stopwatch sw = Stopwatch.StartNew();

for (int task_id = 0; task_id < NUM_TASKS; task_id++)
{
    var task = Task.Factory.StartNew(task_id_obj =>
    {
        CountdownEvent cdeForSetOperation = new CountdownEvent(NUM_KEYS);
        int local_task_id = (int)task_id_obj!;

        for (int i = 0; i < NUM_KEYS; i++)
        {
            string key = $"key_{local_task_id}_{i}";
            client.StringSet(key, VALUE_TO_INSERT, null);
            cdeForSetOperation.Signal();
        }

        cdeForSetOperation.Wait();
        Interlocked.Add(ref globalCounter, NUM_KEYS);
    }, task_id);
    tasks.Add(task);
}

Task.WaitAll(tasks.ToArray());

sw.Stop();
Console.WriteLine($"Insertion completed in {sw.Elapsed}, iterations per seconds: {decimal.Round(1000m * NUM_KEYS * NUM_TASKS / sw.ElapsedMilliseconds, 1)}, inserted {globalCounter} values");

Console.WriteLine("Querying data");

tasks.Clear();
await Task.Delay(1000);

sw = Stopwatch.StartNew();
const int NUM_ITERATIONS = 1_000_000;

// A dummy counter to do something with the value from the db
long totalLength = 0;

for (int task_id = 0; task_id < NUM_TASKS; task_id++)
{
    var task = Task.Factory.StartNew(() =>
    {
        long length = 0;
        CountdownEvent cdeForQuery = new CountdownEvent(NUM_ITERATIONS);

        for (int i = 0; i < NUM_ITERATIONS; i++)
        {
            client.StringGet("key_0_0", (c, v) =>
            {
                Interlocked.Add(ref length, v?.Length ?? 0);
                cdeForQuery.Signal();
            });
        };

        cdeForQuery.Wait();
        Interlocked.Add(ref totalLength, length);
    });

    tasks.Add(task);
}

Task.WaitAll(tasks.ToArray());

sw.Stop();
Console.WriteLine($"Querying completed in {sw.Elapsed}, iterations per seconds: {decimal.Round(1000m * NUM_ITERATIONS * NUM_TASKS / sw.ElapsedMilliseconds, 1)}, total length = {totalLength}, {(totalLength == NUM_TASKS * NUM_ITERATIONS * VALUE_TO_INSERT.Length ? "ok" : "NOT OK")}");

await client.ExecuteForStringResultAsync("FLUSHDB");

Hermholtz avatar Mar 28 '24 13:03 Hermholtz

GarnetClient does have buffer limits that might cause issues under heavy load. Try checking and increasing the parameters in the constructor. As workaround, you can continue to use clients such as StackExchange.Redis.

badrishc avatar Apr 01 '24 01:04 badrishc

For me it's not a production issue, just letting you know so that you can possibly fix it to make Garnet more stable.

Increasing maxOutstandingTasks to the maximum - 1_048_576 seems to help a little, but does not solve the issue.

If this is throttling, I would expect the CPU usage to remain low and just block, but it's always high when the hang happens. It looks like a lot of spin locks are used. In throttling scenarios it should spin a little and then if the resource is still not available just refrain from further spinning and sleep, such as in SpinWait.

Hermholtz avatar Apr 01 '24 07:04 Hermholtz

We have verified that GarnetServer is running fine and stable for this scenario.

GarnetClient - our sample C# client included in the repo - is not designed for awaiting millions of tasks in parallel, without increasing the specified system defaults. Increasing maxOutstandingTasks and sendPageSize can alleviate the problem. GarnetClient's default configuration parameters are designed for low task parallelism per client session. When you have a million tasks executing in parallel, the output buffer gets filled up, and all the individual tasks start waiting for space in the buffer. Finally, when space is opened up, all the tasks get released which causes them to retry, explaining the CPU usage you see. For such a scenario, using StackExchange.Redis as your client would be preferable.

badrishc avatar May 07 '24 22:05 badrishc