mcrouter icon indicating copy to clipboard operation
mcrouter copied to clipboard

Debugging performance bottleneck

Open pySilver opened this issue 6 years ago • 9 comments

Hi!

Sorry to post question-like issue here. I'm still waiting for approve at FB group, which is not very active I guess. I'm observing some performance issues and I'm not sure if thats expected performance or something is wrong.

My hardware setup is the following (all servers are the same: (AMD Ryzen 7 1700X, 64 GB DDR4 RAM):

  • 4 app servers (mcrouter and memcached is installed on each of them)
  • 4 memcached servers (no mcrouter here) ...4 other unrelated servers (lb, db, etc)

all the servers in this cluster are located in the same rack, connected with 1Gbit switch to 1Gbit uplink.

mcrouter config:

{
  "pools": {
    "A": {
      "servers": ["***8 ip:port here***"]
    }
  },
  "route": "PoolRoute|A"
}

Each mcrouter starts with 16 proxies (%CPU_CORES% used).

Here are the worst stats (one server somehow constantly gives much higher delay):

STAT 1.1.1.122:11211:ascii:plain:notcompressed-1000 avg_latency_us:2075.732 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:238 found:15081 notfound:8639 stored:1540
STAT 1.1.1.121:11211:ascii:plain:notcompressed-1000 avg_latency_us:1834.307 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:391 found:17475 notfound:10973 stored:1702
STAT 1.1.1.126:11211:ascii:plain:notcompressed-1000 avg_latency_us:1859.976 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:156 found:18107 notfound:5828 stored:1123
STAT 1.1.1.98:11211:ascii:plain:notcompressed-1000 avg_latency_us:2087.405 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:467 found:16376 notfound:4514 stored:1196
STAT 1.1.1.120:11211:ascii:plain:notcompressed-1000 avg_latency_us:1488.671 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:199 found:16742 notfound:7349 notstored:933 stored:572
STAT 1.1.1.125:11211:ascii:plain:notcompressed-1000 avg_latency_us:1707.719 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:410 found:13755 notfound:6009 stored:1421
STAT 1.1.1.123:11211:ascii:plain:notcompressed-1000 avg_latency_us:325.893 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:1374 found:13631 notfound:7603 stored:3282
STAT 1.1.1.124:11211:ascii:plain:notcompressed-1000 avg_latency_us:1809.246 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:116 found:15838 notfound:10728 stored:502
END

this is localhost and typical avg delay is 300us?

STAT 1.1.1.123:11211:ascii:plain:notcompressed-1000 avg_latency_us:325.893 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:1374 found:13631 notfound:7603 stored:3282

All other servers shows much better results:

STAT 1.1.1.122:11211:ascii:plain:notcompressed-1000 avg_latency_us:901.975 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:202500 found:20982648 notfound:11855563 notstored:3569283 stored:490530 connect_error:1
STAT 1.1.1.98:11211:ascii:plain:notcompressed-1000 avg_latency_us:877.655 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:111172 found:21463877 notfound:7890507 notstored:4573362 stored:406883 connect_error:1
STAT 1.1.1.126:11211:ascii:plain:notcompressed-1000 avg_latency_us:753.287 pending_reqs:0 inflight_reqs:1 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:347 found:22592492 notfound:1795408 notstored:4613226 stored:297399 connect_error:1
STAT 1.1.1.121:11211:ascii:plain:notcompressed-1000 avg_latency_us:847.644 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:107 found:24774641 notfound:2863987 notstored:2814153 stored:294982 connect_error:1
STAT 1.1.1.123:11211:ascii:plain:notcompressed-1000 avg_latency_us:924.621 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:20729 found:22674035 notfound:2171071 notstored:2852931 stored:313937 connect_error:1
STAT 1.1.1.120:11211:ascii:plain:notcompressed-1000 avg_latency_us:806.208 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:130 found:22320523 notfound:673697 notstored:5975115 stored:293596 connect_error:1
STAT 1.1.1.125:11211:ascii:plain:notcompressed-1000 avg_latency_us:728.010 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:968459 found:22461020 notfound:40680146 notstored:2740194 stored:1183403 connect_error:1
STAT 1.1.1.124:11211:ascii:plain:notcompressed-1000 avg_latency_us:81.307 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:2417 found:23566477 notfound:827452 notstored:4053273 stored:295365 connect_error:1
END

so ~avg is around 800us and localhost latency is ~80us.

I havent seen any performance difference when there is almost no traffic (loadavg 1-2) and now when mcourter eats a lot of CPU: (screen) http://take.ms/imZ88

I've made a logger for all my mcrouter interations from php app and typically it looks like this:

Command: `get` called 301 times in 0.30392s with total payload of 6.58MB
Command: `delete` called 73 times in 0.03787s with total payload of 0B
Command: `set` called 21 times in 0.01144s with total payload of 4.49kB
Command: `add` called 59 times in 0.11544s with total payload of 126.74kB

Isn't that quite slow?

I'm connecting to mcrouter this way:

$this->m = new \Memcached();
$this->m->addServer('127.0.0.1', 5000, 0);
$this->m->setOption(\Memcached::OPT_LIBKETAMA_COMPATIBLE, true);
$this->m->setOption(\Memcached::OPT_COMPRESSION, true);

I've tried to disable compression but without noticeable changes.

My APM monitoring shows even worse numbers for some requests:

Database calls (15): 51.0 ms | HTTP calls (1): 332 ms | **Memcache calls (536): 627 ms** |   |   | File I/O (1.8%): 28.0 ms

What might be wrong?

pySilver avatar Jan 02 '19 20:01 pySilver

@pySilver - I would suggest that you try to do the following to narrow down the problem: run only 4 memecached servers, with no overlap with the app servers (no mcrouter and memcached running together); include the slow one in those four of course.

orishu avatar Jan 02 '19 22:01 orishu

@orishu I will try that and post back

pySilver avatar Jan 02 '19 22:01 pySilver

@orishu ok, I've done what you recommended: app servers stayed with mcrouter and no memcached on localhost. Here are some results from the slowest server (latency for others did not change and stayed in ~800us)

STAT x.x.x.122:11211:ascii:plain:notcompressed-1000 avg_latency_us:2993.472 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:689 found:40028 notfound:23521 notstored:2553 stored:4551
STAT x.x.x.121:11211:ascii:plain:notcompressed-1000 avg_latency_us:3300.188 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:1070 found:45706 notfound:29332 notstored:2247 stored:4911
STAT x.x.x.98:11211:ascii:plain:notcompressed-1000 avg_latency_us:3415.964 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:1291 found:43277 notfound:12779 notstored:2427 stored:3692
STAT x.x.x.120:11211:ascii:plain:notcompressed-1000 avg_latency_us:3654.085 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:16; deleted:539 found:44172 notfound:20275 notstored:4962 stored:1920
END

I've also made a test with iperf3 from all app servers to one of memcached servers. Surprisingly app server with highest latency showed best performance with ~800mbps while others ended up with 600-700mbps...

$ iperf3 -c x.x.x.122 -p 5201
Connecting to host x.x.x.122, port 5201
[  4] local x.x.x.123 port 51496 connected to x.x.x.122 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  97.8 MBytes   820 Mbits/sec    0    161 KBytes
[  4]   1.00-2.00   sec  99.0 MBytes   831 Mbits/sec    0    174 KBytes
[  4]   2.00-3.00   sec  99.2 MBytes   832 Mbits/sec    0    208 KBytes
[  4]   3.00-4.00   sec  99.1 MBytes   831 Mbits/sec    0    208 KBytes
[  4]   4.00-5.00   sec  99.0 MBytes   831 Mbits/sec    0    208 KBytes
[  4]   5.00-6.00   sec  98.9 MBytes   830 Mbits/sec    0    208 KBytes
[  4]   6.00-7.00   sec  99.1 MBytes   831 Mbits/sec    0    208 KBytes
[  4]   7.00-8.00   sec  99.1 MBytes   831 Mbits/sec    0    208 KBytes
[  4]   8.00-9.00   sec  99.1 MBytes   831 Mbits/sec    0    208 KBytes
[  4]   9.00-10.00  sec  99.0 MBytes   830 Mbits/sec    0    208 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   989 MBytes   830 Mbits/sec    0             sender
[  4]   0.00-10.00  sec   989 MBytes   830 Mbits/sec                  receiver

Ping from slowest to memcached box:

PING x.x.x.122 (x.x.x.122) 56(84) bytes of data.
64 bytes from x.x.x.122: icmp_seq=1 ttl=63 time=0.129 ms
64 bytes from x.x.x.122: icmp_seq=2 ttl=63 time=0.116 ms
64 bytes from x.x.x.122: icmp_seq=3 ttl=63 time=0.141 ms
64 bytes from x.x.x.122: icmp_seq=4 ttl=63 time=0.098 ms
64 bytes from x.x.x.122: icmp_seq=5 ttl=63 time=0.140 ms
64 bytes from x.x.x.122: icmp_seq=6 ttl=63 time=0.136 ms
64 bytes from x.x.x.122: icmp_seq=7 ttl=63 time=0.131 ms
64 bytes from x.x.x.122: icmp_seq=8 ttl=63 time=0.113 ms
64 bytes from x.x.x.122: icmp_seq=9 ttl=63 time=0.105 ms
64 bytes from x.x.x.122: icmp_seq=10 ttl=63 time=0.104 ms
64 bytes from x.x.x.122: icmp_seq=11 ttl=63 time=0.110 ms
64 bytes from x.x.x.122: icmp_seq=12 ttl=63 time=0.104 ms
64 bytes from x.x.x.122: icmp_seq=13 ttl=63 time=0.100 ms
64 bytes from x.x.x.122: icmp_seq=14 ttl=63 time=0.120 ms
64 bytes from x.x.x.122: icmp_seq=15 ttl=63 time=0.097 ms

pySilver avatar Jan 03 '19 04:01 pySilver

I also made a UDP connection testing between this servers:

Server

$ iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from x.x.x.123, port 59908
[  5] local x.x.x.122 port 5201 connected to x.x.x.123 port 35733
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   120 KBytes   983 Kbits/sec  0.201 ms  0/15 (0%)
[  5]   1.00-2.00   sec   128 KBytes  1.05 Mbits/sec  0.184 ms  0/16 (0%)
[  5]   2.00-3.00   sec   128 KBytes  1.05 Mbits/sec  0.158 ms  0/16 (0%)
[  5]   3.00-4.00   sec   128 KBytes  1.05 Mbits/sec  0.146 ms  0/16 (0%)
[  5]   4.00-5.00   sec   128 KBytes  1.05 Mbits/sec  0.165 ms  0/16 (0%)
[  5]   5.00-6.00   sec   128 KBytes  1.05 Mbits/sec  0.133 ms  0/16 (0%)
[  5]   6.00-7.00   sec   128 KBytes  1.05 Mbits/sec  0.093 ms  0/16 (0%)
[  5]   7.00-8.00   sec   128 KBytes  1.05 Mbits/sec  0.081 ms  0/16 (0%)
[  5]   8.00-9.00   sec   128 KBytes  1.05 Mbits/sec  0.121 ms  0/16 (0%)
[  5]   9.00-10.00  sec   128 KBytes  1.05 Mbits/sec  0.169 ms  0/16 (0%)
[  5]  10.00-11.00  sec   128 KBytes  1.05 Mbits/sec  0.134 ms  0/16 (0%)
[  5]  11.00-12.00  sec   128 KBytes  1.05 Mbits/sec  0.202 ms  0/16 (0%)
[  5]  12.00-13.00  sec   128 KBytes  1.05 Mbits/sec  0.214 ms  0/16 (0%)
[  5]  13.00-14.00  sec   128 KBytes  1.05 Mbits/sec  0.218 ms  0/16 (0%)
[  5]  14.00-15.00  sec   128 KBytes  1.05 Mbits/sec  0.207 ms  0/16 (0%)
[  5]  15.00-16.00  sec   128 KBytes  1.05 Mbits/sec  0.163 ms  0/16 (0%)
[  5]  16.00-17.00  sec   128 KBytes  1.05 Mbits/sec  0.186 ms  0/16 (0%)
[  5]  17.00-18.00  sec   128 KBytes  1.05 Mbits/sec  0.142 ms  0/16 (0%)
[  5]  18.00-19.00  sec   128 KBytes  1.05 Mbits/sec  0.170 ms  0/16 (0%)
[  5]  19.00-20.00  sec   128 KBytes  1.05 Mbits/sec  0.119 ms  0/16 (0%)
[  5]  20.00-21.00  sec   128 KBytes  1.05 Mbits/sec  0.129 ms  0/16 (0%)
[  5]  21.00-22.00  sec   128 KBytes  1.05 Mbits/sec  0.139 ms  0/16 (0%)
[  5]  22.00-23.00  sec   128 KBytes  1.05 Mbits/sec  0.131 ms  0/16 (0%)
[  5]  23.00-24.00  sec   128 KBytes  1.05 Mbits/sec  0.152 ms  0/16 (0%)
[  5]  24.00-25.00  sec   128 KBytes  1.05 Mbits/sec  0.162 ms  0/16 (0%)
[  5]  25.00-26.00  sec   128 KBytes  1.05 Mbits/sec  0.126 ms  0/16 (0%)
[  5]  26.00-27.00  sec   128 KBytes  1.05 Mbits/sec  0.148 ms  0/16 (0%)
[  5]  27.00-28.00  sec   128 KBytes  1.05 Mbits/sec  0.114 ms  0/16 (0%)
[  5]  28.00-29.00  sec   128 KBytes  1.05 Mbits/sec  0.088 ms  0/16 (0%)
[  5]  29.00-30.00  sec   128 KBytes  1.05 Mbits/sec  0.098 ms  0/16 (0%)
[  5]  30.00-31.00  sec   128 KBytes  1.05 Mbits/sec  0.158 ms  0/16 (0%)
[  5]  31.00-32.00  sec   128 KBytes  1.05 Mbits/sec  0.119 ms  0/16 (0%)
[  5]  32.00-33.00  sec   128 KBytes  1.05 Mbits/sec  0.132 ms  0/16 (0%)
[  5]  33.00-34.00  sec   128 KBytes  1.05 Mbits/sec  0.196 ms  0/16 (0%)
[  5]  34.00-35.00  sec   128 KBytes  1.05 Mbits/sec  0.148 ms  0/16 (0%)
[  5]  35.00-36.00  sec   128 KBytes  1.05 Mbits/sec  0.208 ms  0/16 (0%)
[  5]  36.00-37.00  sec   128 KBytes  1.05 Mbits/sec  0.155 ms  0/16 (0%)
[  5]  37.00-38.00  sec   128 KBytes  1.05 Mbits/sec  0.165 ms  0/16 (0%)
[  5]  38.00-39.00  sec   128 KBytes  1.05 Mbits/sec  0.222 ms  0/16 (0%)
[  5]  39.00-40.00  sec   128 KBytes  1.05 Mbits/sec  0.213 ms  0/16 (0%)
[  5]  40.00-41.00  sec   128 KBytes  1.05 Mbits/sec  0.178 ms  0/16 (0%)
[  5]  41.00-42.00  sec   128 KBytes  1.05 Mbits/sec  0.160 ms  0/16 (0%)
[  5]  42.00-43.00  sec   128 KBytes  1.05 Mbits/sec  0.133 ms  0/16 (0%)
[  5]  43.00-44.00  sec   128 KBytes  1.05 Mbits/sec  0.133 ms  0/16 (0%)
[  5]  44.00-45.00  sec   128 KBytes  1.05 Mbits/sec  0.123 ms  0/16 (0%)
[  5]  45.00-46.00  sec   128 KBytes  1.05 Mbits/sec  0.139 ms  0/16 (0%)
[  5]  46.00-47.00  sec   128 KBytes  1.05 Mbits/sec  0.198 ms  0/16 (0%)
[  5]  47.00-48.00  sec   128 KBytes  1.05 Mbits/sec  0.149 ms  0/16 (0%)
[  5]  48.00-49.00  sec   128 KBytes  1.05 Mbits/sec  0.172 ms  0/16 (0%)
[  5]  49.00-50.00  sec   128 KBytes  1.05 Mbits/sec  0.153 ms  0/16 (0%)
[  5]  50.00-51.00  sec   128 KBytes  1.05 Mbits/sec  0.170 ms  0/16 (0%)
[  5]  51.00-52.00  sec   128 KBytes  1.05 Mbits/sec  0.144 ms  0/16 (0%)
[  5]  52.00-53.00  sec   128 KBytes  1.05 Mbits/sec  0.125 ms  0/16 (0%)
[  5]  53.00-54.00  sec   128 KBytes  1.05 Mbits/sec  0.120 ms  0/16 (0%)
[  5]  54.00-55.00  sec   128 KBytes  1.05 Mbits/sec  0.178 ms  0/16 (0%)
[  5]  55.00-56.00  sec   128 KBytes  1.05 Mbits/sec  0.119 ms  0/16 (0%)
[  5]  56.00-57.00  sec   128 KBytes  1.05 Mbits/sec  0.147 ms  0/16 (0%)
[  5]  57.00-58.00  sec   128 KBytes  1.05 Mbits/sec  0.118 ms  0/16 (0%)
[  5]  58.00-59.00  sec   128 KBytes  1.05 Mbits/sec  0.150 ms  0/16 (0%)
[  5]  59.00-60.00  sec   128 KBytes  1.05 Mbits/sec  0.132 ms  0/16 (0%)
[  5]  60.00-60.04  sec  0.00 Bytes  0.00 bits/sec  0.132 ms  0/0 (-nan%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-60.04  sec  7.49 MBytes  1.05 Mbits/sec  0.132 ms  0/959 (0%)
-----------------------------------------------------------

Client

$ iperf3 -u -c x.x.x.122 -p 5201 -t 60
Connecting to host x.x.x.122, port 5201
[  4] local x.x.x.123 port 35733 connected to x.x.x.122 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec   120 KBytes   983 Kbits/sec  15
[  4]   1.00-2.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   2.00-3.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   3.00-4.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   4.00-5.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   5.00-6.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   6.00-7.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   7.00-8.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   8.00-9.00   sec   128 KBytes  1.05 Mbits/sec  16
[  4]   9.00-10.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  10.00-11.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  11.00-12.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  12.00-13.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  13.00-14.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  14.00-15.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  15.00-16.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  16.00-17.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  17.00-18.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  18.00-19.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  19.00-20.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  20.00-21.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  21.00-22.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  22.00-23.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  23.00-24.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  24.00-25.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  25.00-26.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  26.00-27.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  27.00-28.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  28.00-29.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  29.00-30.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  30.00-31.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  31.00-32.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  32.00-33.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  33.00-34.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  34.00-35.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  35.00-36.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  36.00-37.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  37.00-38.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  38.00-39.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  39.00-40.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  40.00-41.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  41.00-42.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  42.00-43.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  43.00-44.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  44.00-45.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  45.00-46.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  46.00-47.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  47.00-48.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  48.00-49.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  49.00-50.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  50.00-51.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  51.00-52.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  52.00-53.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  53.00-54.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  54.00-55.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  55.00-56.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  56.00-57.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  57.00-58.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  58.00-59.00  sec   128 KBytes  1.05 Mbits/sec  16
[  4]  59.00-60.00  sec   128 KBytes  1.05 Mbits/sec  16
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-60.00  sec  7.49 MBytes  1.05 Mbits/sec  0.132 ms  0/959 (0%)
[  4] Sent 959 datagrams

iperf Done.

If I understand correctly avg latency in mcrouter is measured in nanoseconds and 3000us equals 3ms? It doesn't look like "slow". Still cannot figure out how such small number of keys and data transfer can take so much time.

I've implemented multi-get on app level to decrease number of gets, it helped but I guess its way way far away from typical expected numbers (?)

Latest app log (mcrouter running with app servers, memcached running on a separate boxes):

Sample 1:
Memcached Object Cache
Command: `get` called 411 times in 0.41412s with total payload of 4.32MB
Command: `delete` called 75 times in 0.04236s with total payload of 0B
Command: `set` called 6 times in 0.15476s with total payload of 4.29MB
Command: `add` called 2 times in 0.05735s with total payload of 2.86MB

Sample 2:
Memcached Object Cache
Command: `get` called 91 times in 0.18246s with total payload of 6.55MB
Command: `set` called 2 times in 0.00163s with total payload of 4.49kB
Command: `delete` called 73 times in 0.04110s with total payload of 0B
Command: `getMulti` called 1 times in 0.01439s with total payload of 390.09kB

Direct connection to a remote memcached from app server:

Memcached Object Cache
Command: `get` called 385 times in 0.20129s with total payload of 6.55MB
Command: `delete` called 73 times in 0.03970s with total payload of 0B
Command: `set` called 1 times in 0.00020s with total payload of 3.28kB
Command: `getMulti` called 1 times in 0.01336s with total payload of 350.69kB

Direct connection to a local memcached at app server:

Memcached Object Cache
Command: `get` called 385 times in 0.09051s with total payload of 6.55MB
Command: `delete` called 73 times in 0.00345s with total payload of 0B
Command: `set` called 1 times in 0.00011s with total payload of 3.28kB
Command: `getMulti` called 1 times in 0.00174s with total payload of 219.75kB

pySilver avatar Jan 03 '19 05:01 pySilver

So what you're seeing is one out of four app servers exhibiting different network behavior. Can you compare the app servers' networking configuration?

  • NIC type (lspci)
  • Kernel parameters: sysctl -a | grep -E '^net' I would also double check that the four app servers are running the exact same set of services and background tasks, with similar workloads. Not sure I can help more than that... if you suspect mcrouter/memcached issue that try to narrow it down to these components...

orishu avatar Jan 03 '19 18:01 orishu

@orishu thanks for advice I will check this. It looks like a hardware/network problem to me so I'd better bother devops with that. What performance improvements I should expect otherwise? x10, x100 faster retrieval?

pySilver avatar Jan 03 '19 18:01 pySilver

What performance improvements I should expect otherwise? x10, x100 faster retrieval?

Do you mean by using "getMulti"? I think less than 10x

orishu avatar Jan 03 '19 18:01 orishu

Nope, I mean by using series of get in a row (lots of get's are unpredictable and depends on state so I cant move them to getMulti).

Current typical performance: Command: `get` called 397 times in 0.24053s with total payload of 6.56MB

memcached wiki says single node can handle 200k/s on fast network and fast server.

pySilver avatar Jan 03 '19 19:01 pySilver

@orishu I think I've found the guilty peace here. Apparently it's solely php client (php7.2-memcached).

I'll just leave here my benchmark for the sake of people who might face the same issues.

I took a few tests with excellent RedisLab memtier_benchmark utility (see options explanation) using the following command

docker run --rm memtier_benchmark -s <TESTED_IP_HERE> -p <TESTED_PORT_HERE> -P memcache_text -c 69 --multi-key-get=1000 -R --data-size-range=128-512

Test 1: Local memcached

[RUN #1 100%,  11 secs]  0 threads:     2760000 ops,  242397 (avg:  232586) ops/sec, 60.92MB/sec (avg: 59.50MB/sec),  1.14 (avg:  1.19) msec latency

4         Threads
69        Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets       116142.68          ---          ---      1.18600     40488.49
Gets       116142.68     18095.03   1143331.76      1.18400     20357.29
Waits           0.00          ---          ---      0.00000          ---
Totals     232285.36     18095.03   1143331.76      1.18500     60845.79

Looks good.

Test 2: Remote memcached

[RUN #1 100%,  44 secs]  0 threads:     2760000 ops,   62722 (avg:   61468) ops/sec, 15.65MB/sec (avg: 15.45MB/sec),  4.40 (avg:  4.49) msec latency

4         Threads
69        Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        30726.77          ---          ---      4.48900     10711.66
Gets        30726.77        98.33    307169.36      4.48700      5102.47
Waits           0.00          ---          ---      0.00000          ---
Totals      61453.54        98.33    307169.36      4.48800     15814.13

Huge performance drop, apparently due to network trips.

Latter test were made with mcrouter configured as sharded pool:

sharded mcrouter config:

mcrouter \
    --config-str='{"pools":{"A":{"servers":["**2 remote ip:host here**"]}},
                  "route":"PoolRoute|A"}' \
    -p 5001 \
    --num-proxies=%CPU_CORES% # =>16 in my case

Test 3: Local mcrouter with 2 sharded remote memcahced

[RUN #1 100%,  21 secs]  0 threads:     2760000 ops,  153297 (avg:  128963) ops/sec, 38.97MB/sec (avg: 32.42MB/sec),  1.80 (avg:  2.14) msec latency

4         Threads
69        Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        68848.70          ---          ---      2.06100     24001.34
Gets        68848.70       244.06    688242.92      2.21300     11442.61
Waits           0.00          ---          ---      0.00000          ---
Totals     137697.40       244.06    688242.92      2.13700     35443.95

It's twice worse than localhost test, but pretty impressive results (I guess due to threads number. Single thread shows as low as 20k ops/sec)

The last test was made just for fun with replicated pool config:

mcrouter \
    --config-str='{"pools":{"A":{"servers":["**2 remote ip:host here**"]}},
                  "route": {"type": "OperationSelectorRoute","operation_policies": {"add": "AllSyncRoute|Pool|A","delete": "AllSyncRoute|Pool|A","get": "LatestRoute|Pool|A", "set": "AllSyncRoute|Pool|A"}}}' \
    -p 5001 \
    --num-proxies=%CPU_CORES%

Test 4: Local mcrouter with 2 replicated remote memcahced

[RUN #1 100%,  23 secs]  0 threads:     2760000 ops,  126556 (avg:  116212) ops/sec, 32.03MB/sec (avg: 29.23MB/sec),  2.18 (avg:  2.37) msec latency

4         Threads
69        Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        60142.08          ---          ---      2.43100     20966.13
Gets        60142.08       264.63    601156.17      2.31200     10012.47
Waits           0.00          ---          ---      0.00000          ---
Totals     120284.16       264.63    601156.17      2.37200     30978.60

As good as sharded.

So it's either php app code or php memcached module is dramatically slow. Since I'm not doing anything complex in php code which anyways gets compiled my bet is that php client is the guilty one.

pySilver avatar Jan 04 '19 04:01 pySilver