opencbdc-tx
opencbdc-tx copied to clipboard
Setting TCP_QUICKACK reduces trylock latency by 40ms.
When running PArSEC on Linux (e.g. Debian bookworm on aarch64), there is a 40ms delay between requesting a lock and obtaining a response.
For example, in a 1 agent, 1 shard localhost configuration lua_bench
with 2 wallets produces the following trace for the agent:
[2024-04-05 02:16:17.990] [INFO ] using sha2: standard
[2024-04-05 02:16:17.990] [INFO ] Connecting to shards...
[2024-04-05 02:16:17.990] [INFO ] Connected to shards, connecting to ticketer...
[2024-04-05 02:16:17.990] [INFO ] Requesting broker recovery...
[2024-04-05 02:16:17.991] [TRACE] 0xaaaae9d42170 Broker handled get_tickets for shard 0
[2024-04-05 02:16:18.026] [INFO ] Agent running
[2024-04-05 02:16:32.009] [TRACE] received request with m_function 7061790100000000000000 and param size 144
[2024-04-05 02:16:32.010] [TRACE] received request with m_function 7061790000000000000000 and param size 144
[2024-04-05 02:16:32.012] [TRACE] do_start 7061790100000000000000
[2024-04-05 02:16:32.012] [TRACE] do_start 7061790000000000000000
[2024-04-05 02:16:32.056] [TRACE] 0xaaaae9d42170 Broker locked key for 1000
(snipped)
In the shard log the relevant lines are:
[2024-04-05 02:16:32.055] [TRACE] 1000 requesting lock on 7061790100000000000000 0
[2024-04-05 02:16:32.056] [TRACE] Assigning read lock on 7061790100000000000000 to 1000
[2024-04-05 02:16:32.056] [TRACE] 0xffffb4001580 shard handled try_lock for 1000
Note that agent starts executing the contract at 02:16:32.012 but shard processes its first trylock
whole 43ms later - 02:16:32.055, all while the communication is entirely done over a loopback interface. Using tcpdump -i lo
to inspect the local traffic one sees the following sequence of packets for the same timestamp range (shard is listening on port 5556):
No. Time Source Destination Protocol Length Info
110 2024-04-05 02:16:32.012472 ::1 ::1 TCP 94 47948 → 5556 [PSH, ACK] Seq=1 Ack=1 Win=512 Len=8 TSval=956576386 TSecr=956562365
111 2024-04-05 02:16:32.055574 ::1 ::1 TCP 86 5556 → 47948 [ACK] Seq=1 Ack=9 Win=512 Len=0 TSval=956576429 TSecr=956576386
112 2024-04-05 02:16:32.055611 ::1 ::1 TCP 186 47948 → 5556 [PSH, ACK] Seq=9 Ack=1 Win=512 Len=100 TSval=956576429 TSecr=956576429
That is, there is an agent->shard TCP packet that only gets ACK'd by the shard 43ms later. However after that ACK, the shard's trylock
response is almost immediate.
I believe this comes from a delayed ACK feature in Linux where ACKs get briefly delayed to coalesce sending of ACK receipts and improve throughput (at expense of latency). According to https://topic.alibabacloud.com/a/linux-tcp-delay-confirmation_1_16_31848681.html, the default delay is exactly 40ms, suspiciously close to number observed above. This hypothesis is further strengthened by the fact that this PR (which enables TCP_QUICKACK) improves 5 wallet lua_bench
tx submission rate by around 100x (from ~15tps to ~1500tps).
As suggested by https://github.com/netty/netty/issues/13610, this needs to be done after each recv
- in fact, doing it only at the end of tcp_socket::receive
(and once when setting up the socket) only improved the tx submission rate by 2x (to ~28tps)).
I think that in our setting enabling TCP_NODELAY
(which disables coalescing on the sending side) also makes sense for our protocol but I did not see significant performance changes, so maybe that can be omitted. (The best option for our trylock
interface would probably be doing each trylock
request and response inside a single writev
or TCP_CORK
.) For now, this patch does not do any error handling and does not gate setsockopt
to platforms where TCP_QUICKACK
is defined (I guess, only Linux). How should this be improved?
Two failing CI checks need to be addressed.