opencbdc-tx icon indicating copy to clipboard operation
opencbdc-tx copied to clipboard

Setting TCP_QUICKACK reduces trylock latency by 40ms.

Open madars opened this issue 10 months ago • 1 comments

When running PArSEC on Linux (e.g. Debian bookworm on aarch64), there is a 40ms delay between requesting a lock and obtaining a response.

For example, in a 1 agent, 1 shard localhost configuration lua_bench with 2 wallets produces the following trace for the agent:

[2024-04-05 02:16:17.990] [INFO ] using sha2:  standard
[2024-04-05 02:16:17.990] [INFO ] Connecting to shards...
[2024-04-05 02:16:17.990] [INFO ] Connected to shards, connecting to ticketer...
[2024-04-05 02:16:17.990] [INFO ] Requesting broker recovery...
[2024-04-05 02:16:17.991] [TRACE] 0xaaaae9d42170 Broker handled get_tickets for shard 0
[2024-04-05 02:16:18.026] [INFO ] Agent running
[2024-04-05 02:16:32.009] [TRACE] received request with m_function  7061790100000000000000  and param size  144
[2024-04-05 02:16:32.010] [TRACE] received request with m_function  7061790000000000000000  and param size  144
[2024-04-05 02:16:32.012] [TRACE] do_start  7061790100000000000000
[2024-04-05 02:16:32.012] [TRACE] do_start  7061790000000000000000
[2024-04-05 02:16:32.056] [TRACE] 0xaaaae9d42170 Broker locked key for 1000
(snipped)

In the shard log the relevant lines are:

[2024-04-05 02:16:32.055] [TRACE] 1000 requesting lock on 7061790100000000000000 0
[2024-04-05 02:16:32.056] [TRACE] Assigning read lock on 7061790100000000000000 to 1000
[2024-04-05 02:16:32.056] [TRACE] 0xffffb4001580 shard handled try_lock for 1000

Note that agent starts executing the contract at 02:16:32.012 but shard processes its first trylock whole 43ms later - 02:16:32.055, all while the communication is entirely done over a loopback interface. Using tcpdump -i lo to inspect the local traffic one sees the following sequence of packets for the same timestamp range (shard is listening on port 5556):

No.	Time	Source	Destination	Protocol	Length	Info
110	2024-04-05 02:16:32.012472	::1	::1	TCP	94	47948 → 5556 [PSH, ACK] Seq=1 Ack=1 Win=512 Len=8 TSval=956576386 TSecr=956562365
111	2024-04-05 02:16:32.055574	::1	::1	TCP	86	5556 → 47948 [ACK] Seq=1 Ack=9 Win=512 Len=0 TSval=956576429 TSecr=956576386
112	2024-04-05 02:16:32.055611	::1	::1	TCP	186	47948 → 5556 [PSH, ACK] Seq=9 Ack=1 Win=512 Len=100 TSval=956576429 TSecr=956576429

That is, there is an agent->shard TCP packet that only gets ACK'd by the shard 43ms later. However after that ACK, the shard's trylock response is almost immediate.

I believe this comes from a delayed ACK feature in Linux where ACKs get briefly delayed to coalesce sending of ACK receipts and improve throughput (at expense of latency). According to https://topic.alibabacloud.com/a/linux-tcp-delay-confirmation_1_16_31848681.html, the default delay is exactly 40ms, suspiciously close to number observed above. This hypothesis is further strengthened by the fact that this PR (which enables TCP_QUICKACK) improves 5 wallet lua_bench tx submission rate by around 100x (from ~15tps to ~1500tps).

As suggested by https://github.com/netty/netty/issues/13610, this needs to be done after each recv - in fact, doing it only at the end of tcp_socket::receive (and once when setting up the socket) only improved the tx submission rate by 2x (to ~28tps)).

I think that in our setting enabling TCP_NODELAY (which disables coalescing on the sending side) also makes sense for our protocol but I did not see significant performance changes, so maybe that can be omitted. (The best option for our trylock interface would probably be doing each trylock request and response inside a single writev or TCP_CORK.) For now, this patch does not do any error handling and does not gate setsockopt to platforms where TCP_QUICKACK is defined (I guess, only Linux). How should this be improved?

madars avatar Apr 05 '24 06:04 madars

Two failing CI checks need to be addressed.

HalosGhost avatar Apr 19 '24 15:04 HalosGhost