newrelic-dotnet-agent icon indicating copy to clipboard operation
newrelic-dotnet-agent copied to clipboard

Instrumentation - Stackexchange.Redis causes major performance problems

Open jifeingo opened this issue 4 years ago • 8 comments

Description StackExchange.Redis uses Multiplexing to bundle multiple user calls together. The goal is to reduce network traffic to use as fewer packets as possible. Here is some information about StackExchange Multiplexers.

It seems that the Agent's instrumentation results in far fewer requests being multiplexed (if any). As a result there is a dramatic increase in network traffic resulting in a performance problem.

Example Code: This method may be called 3000 times to fulfill a request for a customer. It can demonstrate the performance degradation with the Agent Attached.

public async Task<string> GetValueAsync(string key)
{
    if (!_connection.IsConnected)
    {
        return null;
    }

    var db = _connection.GetDatabase();
    var val = await db.StringGetAsync(key, flags: CommandFlags.PreferSlave);

    if (val.HasValue)
    {
        return val;
    }

    return null;
}

Expected Behavior No or minimal performance impact when instrumenting Redis

Steps to Reproduce https://source.datanerd.us/dotNetAgent/dotnet_agent/compare/redis-repro TODO: This should be put in github (not ghe).

Adds a /Redis/DoALotOfRedisStuff endpoint to an ASP.NET Core application. This endpoint makes 3000 calls to a Redis instance. Using Wireshark we monitored network traffic to the Redis server with and without the agent enabled.

  • Without the agent enabled we observed ~200 network packets from our host → the Redis server
  • With the agent enabled we observed over ~4000 network packets from our host → the Redis server

Additional context [TIP]: # ( Add any other context about the problem here. For example, relevant community posts or support tickets. )

Repro Test App Internal

jifeingo avatar Oct 07 '20 18:10 jifeingo

OTel .NET SDK and Auto Instrumentation repo are handling the instrumentation of Redis differently. This is a note to review the OTel implementation as a potential replacement for our current implementation.

angelatan2 avatar Jan 22 '21 22:01 angelatan2

Was any progress made on this issue? it looks like its been bouncing around quite a bit through different milestones and would be nice to know if a fix is coming 😄

storey247 avatar Mar 22 '22 17:03 storey247

Up

darkleaf avatar Sep 01 '22 16:09 darkleaf

https://issues.newrelic.com/browse/NEWRELIC-3504

@angelatan2 any update on this please? any eta etc?

storey247 avatar Sep 15 '22 10:09 storey247

@storey247 Thanks for reaching out. Do you have further information on how the New Relic agent impacts the performance of your application? Any further information or more recent information on which aspects of the performance were impacted is helpful. A brief description of what your application is doing when you see the performance impact rising is also useful.

We intend to look into this bug in the upcoming milestone barring any unexpected issues.
@chynesNR

angelatan2 avatar Sep 15 '22 15:09 angelatan2

I have been working on a fix for this and have some information as well as an experimental build for those interested.

Some very quick background, our CLR profiler (C++ based) reads the extension XML files in order to determine what methods it should inject with bytecode. This bytecode is what calls into our managed agent (C# based) which can have additional checks to see if and how we should handle the method - these are the various "wrappers".

Currently, our agent instruments StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl and StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImpl to capture details about StackExchange.Redis operations. These methods exist in both v1.x and v2.x of StackExchange.Redis. After a bit of testing, it appears that just adding our basic bytecode to those methods causes a change in how the ConnectionMultiplexer behaves. Instead of combining the operations into larger packets, it sends them in much smaller chunks, often one operation per packet. This change occurs whether or not a wrapper is selected to gather data about the method.

Starting in StackExchange.Redis version 2, a built in profiling system using callbacks with automated sessions (1.x requires custom classes and manual sessions) was implemented. Using this system we can instrument a part of the connection to capture the ConnectionMultiplexer and register our profiling callback. Since this occurs infrequently at best, there is little performance impact and the ConnectionMultiplexer retains its normal behavior, mostly. Instead of creating each StackExchange.Redis operation segment in real-time, we can check for any active profiling sessions and create the segments in a batch - still at the correct location in the trace.

This new system has a few benefits that mesh well with our agent. The profiling system being automated and using a callback allows the agent to intelligently determine when a new session is needed and when an existing one will work. We only need to check if our segment has a session using a fast conditional at the end of the segment. To make it even more efficient, we only perform this check if the storage mechanism we use for sessions was created - no StackExchange.Redis, no expensive work in segments. When looking at v1.x and its profiling system, it had few of these benefits. Since we would have to start the profiling sessions manually, we would need to do that for every segment, once we know that StackExchange.Redis is present, since we have no way to know when an operation will occur.

Moving back to the new system, the overhead for this new way of capturing data does not incur the same penalties as the previous method since we are only instrumenting the initial connection and not the operations individually. The number of packets does still increase, but it is much more reasonable. Here is a quick chart showing how 3000 operations in a single transaction from a WebAPI app behaves as seen in Wireshark:

Measurement No agent Old agent New/fixed agent
Packets 148.00 3,646.00 355.00
Average packet size, B 550.00 84.00 267.00

Unfortunately, the old instrumentation and the new method cannot currently coexist since just adding the bytecode causes the breakdown. Making this more problematic is that the automated callback based profiling system only exists in v2.x of StackExchange.Redis. This means that we either support both versions, with the overhead, or only 2.x with much better overhead.

The attached installed are built from my branch in our CI and are being offered as an experimental build so that we can gather feedback on our new approach. As they are still considered experimental, please do not use them in production.

stackexchange.redis-windows.zip stackexchange.redis-linux.zip

jaffinito avatar Oct 03 '22 17:10 jaffinito

Hello @darkleaf, and @storey247, thank you for your patience on this ticket. Per the comments above, we would appreciate it if you could try the experimental build that was linked above and share your feedback. Information on how the experimental release performs in your environment will help us understand if this approach improves the situation for your product.

angelatan2 avatar Oct 03 '22 19:10 angelatan2

Hello @jaffinito We have load tested our app with your experimental build. The experimental agent reduced cpu and network utilization. It should also be noted that in the case of the experimental agent, the network load decreased unexpectedly.

  1. no agent
  2. regular agent
  3. experimental agent

cpu_1 cpu_2 cpu_3
bytes_sent_1 bytes_sent_2 bytes_sent_3

darkleaf avatar Dec 15 '22 13:12 darkleaf

Hi @darkleaf - thanks for running that test for us! I'm glad to hear the performance has improved. We hope to have the change in an official release early next year. Regarding the lower bandwidth -- am I understanding you correctly that the test build uses less bandwidth than the application with no Agent installed? That certainly would be unexpected. Are you still seeing all the data you expect, and is the application functioning correctly?

chynesNR avatar Dec 16 '22 23:12 chynesNR

@chynesNR Correct, the test build used less bandwidth than the application with no Agent installed. I didn't participate in the testing process. But I didn't hear about any errors from our team.

PS We used the same payload for all tests.

darkleaf avatar Jan 13 '23 13:01 darkleaf

Work has been completed on this issue.