twemproxy mbuf_split() splits by copying data in quadratic instead of linear time

Hi All

We are looking to use TwemProxy with Redis for sharding. We have use cases where we may need to fetch about 10k keys in one go from across multiple shards. However, when I try this with TwemProxy on a test setup (described below), it takes about 1.7 seconds to return. If I fired the same request on a single Redis instance directly, it returns in about 16ms.

Would someone know if I have missed out some obvious performance tuning option? Any ideas on what we could do to better this number? We need mget response time to be within sub 300ms range.

Note that the hosts are shared with other processes (and not dedicated for my performance tests). I am giving the details below to give an indication of the leftover capacity.

Setup Details- Host One- 8 core, 3.3 GHz (5% utilization during test run), 16GB RAM (6GB available) hosts two redis instances and a TwemProxy instance

Host Two- 8 core, 3.3 GHz (5% utilization during test run), 16GB RAM (9.5GB available) hosts two redis instances

TwemProxy config- hash: murmur distribution: ketama servers:

host1:6379:1
host1:6380:1
host2:6379:1
host2:6380:1

Thank you

Oct 28 '13 05:10 neeleshkorade

10k is crazy but, try this:

use -m 1024 or -m 2048 and,
delete the following lines https://github.com/twitter/twemproxy/blob/master/src/nc_message.c#L27-L31 and redefine #define NC_IOV_MAX 1024 or #define NC_IOV_MAX 2048

Rerun your test case and see what latency numbers you get

Oct 28 '13 20:10 manjuraj

Thanks for your response Manju. I will try out these settings and let you know how it performs.

I have couple of more questions I need your help on.

1. I have a test TwemProxy setup with two redis instances behind it. I have configured auto_eject_hosts: true and distribution: ketama. If my understanding is correct, with this configuration, if either of the hosts goes down, the client doesn't see any connection error. All that will happen is the keys which were earlier hashed to the redis instance which is shutdown will now map to the other host, resulting in some key lookups returning null (even though I had set values against it before bringing down the Redis instance). Is this understanding correct? Contrary to this, what I see is an error "(error) ERR Connection refused". Could you clarify if this is the expected behavior and if so what does "auto_eject_hosts: true" really means?
2. I am using Jedis client for connecting with Redis. With TwemProxy in between, my client is actually connecting to TwemProxy instance. I noticed that with this, I can't use Redis pipelines. I get "ERR Connection refused" error. If I change my Jedis client config to connect to Redis instance directly, my pipeline code works fine. Is it that Redis pipelines do not work with TwemProxy? Or is it an issue with my client library (Jedis)? In case TwemProxy doesn't support it, what would be an efficient way to implement multi-key get for Redis hashes (as it is not supported out of box by Redis)?

Thank you for all the help.

Neelesh

On Tuesday, October 29, 2013 2:04 AM, Manju Rajashekhar [email protected] wrote:

10k is crazy but, try this: * use -m 1024 or -m 2048 and, * delete the following lines https://github.com/twitter/twemproxy/blob/master/src/nc_message.c#L27-L31 and redefine #define NC_IOV_MAX 1024 or #define NC_IOV_MAX 2048 Rerun your test case and see what latency numbers you get — Reply to this email directly or view it on GitHub.

Oct 30 '13 19:10 neeleshkorade

for (1) see this: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness and this: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#timeout. Let me know if you still have questions.

for (2), I think jedis send a ping command on every new connection before sending data. Twemproxy does not support ping yet. The fix would be to either fix jedis to not send a ping command on every new connection or support a "pseudo" ping in twemproxy. Both fixes should be fairly trivial

Oct 30 '13 20:10 manjuraj

try this patch:

diff --git a/src/nc_message.c b/src/nc_message.c
index 654cdf9..84ef8e5 100644
--- a/src/nc_message.c
+++ b/src/nc_message.c
@@ -24,11 +24,7 @@
 #include <nc_server.h>
 #include <proto/nc_proto.h>

-#if (IOV_MAX > 128)
-#define NC_IOV_MAX 128
-#else
 #define NC_IOV_MAX IOV_MAX
-#endif

and try with -m with values for this set (512, 1024, 2048, 4096, 8192)

I have a hunch that one with -m 512 and iov-max something between (128 - 1024) should give you best results

Oct 30 '13 20:10 manjuraj

@neeleshkorade any updates on this?

Nov 06 '13 19:11 manjuraj

Hi Manju,

Of the three questions I raised under this issue, pipeline not working with TwemProxy+Jedis combination has been resolved. It turned out that I was also using transactions which is not supported by TwemProxy.

I also gathered how to handle redis instances going down by having a higher (than server_failure_limit) retry count on the client side. I am implementing this in the code at the moment.

I am still working on the last one (performance issue with multiget using large number of keys). Will update you on how it performs with the patch you have suggested by tomorrow.

Nov 07 '13 13:11 neeleshkorade

Hi Manjuraj,

I tried the config changes you suggested along with the code patch to see how multi-key get performs. Here's what I observed-

With -m 512, I got the best results. With this, multi-key get on 10k keys returned in about 750ms (against 1650ms with default options, i.e. membuf size of 16k).

By progressively increasing the membuf size to 1024, 2048, 4096 and 8192, the response time went up in that order-

for -m 1024, it returned in 1020 ms for -m 2048, it returned in 1200 ms for -m 4096, it returned in 1350 ms for -m 8192, it returned in 1420 ms

Note-

These are average of three readings
For multi-get with smaller number of keys (512), the improvement was only marginal (about 8%). With default config, it returned in about 24ms. With -m 512, it took 22ms.
With the code patch you suggested, the performance degraded in all cases. For example, with default config (-m 16k), multi-get for 10k keys returned in 3700 ms.

Question for you-a I see that with smaller sizes of membuf, multi-key get performed better. However, the nutcracker documentation at https://github.com/twitter/twemproxy#zero-copy says that larger membuz size reduces syscalls making me believe that it should lead to improvement in performance. This is contrary to my observation through these tests. Could you clarify?

Nov 12 '13 17:11 neeleshkorade

@neeleshkorade sorry for the delay.

Large buf size means less syscall from the writev() and read() because you are doing more in one syscall. However large buf size also means more overhead in copying because of the way mbuf_split() is implemented.

Here are the details: Lets say that your read sys call reads say 10 commands, we need to split them across 10 mbufs (actually we split the 10 commands across 10 struct msg, and each struct msg has at least one struct mbuf). This is done by mbuf_split() - https://github.com/twitter/twemproxy/blob/master/src/nc_mbuf.c#L228. Unfortunately this mbuf_split() is quadratic in complexity.

For example, if my input buffer from read syscall contains 10 messages = [1, 2, 3, 4, 5, 6, ... 10], we leave existing message "1" in its current mbuf and copy messages from [2,3,4,5, ...10] to a new mbuf. Once message "1" is processed, we then we copy messages from [3,4,5,6,...10] to a new mbuf and so on and on. So, to split messages [1,2,3...10] across 10 mbufs we are doing quadratic instead of linear copies. This is really unfortunate, and I think if we fix this we twemproxy will consume less cpu. The workaround for now is that for scenarios where you need high throughput and low cpu util %, you use -mbuf-size=512.

Dec 30 '13 22:12 manjuraj

For example, if my input buffer from read syscall contains 10 messages = [1, 2, 3, 4, 5, 6, ... 10], we leave existing message "1" in its current mbuf and copy messages from [2,3,4,5, ...10] to a new mbuf. Once message "1" is processed, we then we copy messages from [3,4,5,6,...10] to a new mbuf and so on and on. So, to split messages [1,2,3...10] across 10 mbufs we are doing quadratic instead of linear copies. This is really unfortunate, and I think if we fix this we twemproxy will consume less cpu. The workaround for now is that for scenarios where you need high throughput and low cpu util %, you use -mbuf-size=512.

If I'm reading this right, this is only a performance issue with pipelined commands (and common redis clients make pipelining convenient)

I'd wonder if it'd make sense to conditionally swap the pointers with the "new" mbuf if mbuf_split had a larger buffer for a new message (nmsg) - that would hopefully make it take less time.

(i.e. replace the old mbuf's pointers with the smaller copy at the beginning, and use the old pointers with the larger size for the new mbuf)

Apr 29 '21 02:04 TysonAndre