Updated http3 benchmark
**http2load**
finished in 65.00s, 83685.82 req/s, 94.49MB/s
requests: 5021149 total, 5021149 started, 5021149 done, 5021149 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 5021166 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 5.54GB (5944777841) total, 798.07MB (836837680) headers (space savings 35.31%), 5.19GB (5568962560) data
UDP datagram: 2993583 sent, 6781646 received
min max mean sd +/- sd
time for request: 2.78ms 41.58ms 10.85ms 4.10ms 75.84%
time for connect: 0us 0us 0us 0us 0.00%
time to 1st byte: 0us 0us 0us 0us 0.00%
req/s : 484.90 1073.51 836.85 186.05 71.00%
**dstat**
You did not select any stats, using -cdngy by default.
----total-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send: recv send| in out | int csw
64 7 28 0 0| 40k 1313k| 532B 532B:7632k 97M|0.20 0 |3491k 311k
68 7 24 0 0| 0 506k| 532B 532B:8075k 102M|0.20 0 |3597k 318k
69 7 24 0 0| 0 146k| 552B 552B:8112k 102M|4.60 0 |3698k 323k
69 7 23 0 0| 0 35M| 532B 532B:8193k 104M|1.60 0 |3709k 321k
69 7 23 0 0| 0 0 | 592B 592B:8335k 105M|1.00 0 |3759k 323k
69 7 23 0 0| 0 267k| 532B 532B:8191k 103M|1.00 0 |3700k 322k
69 7 24 0 0| 0 0 | 532B 532B:8245k 104M|0.40 0 |3720k 320k
69 7 24 0 0| 0 1638B| 532B 532B:8128k 101M|0.60 0 |3691k 322k
69 7 24 0 0|2257k 57M| 532B 532B:8177k 103M|11.6 0 |3694k 324k
69 7 24 0 0| 0 17M| 532B 532B:8196k 102M|1.60 0 |3723k 321k
69 7 23 0 0| 0 0 | 592B 592B:8264k 104M|1.40 0 |3746k 321k
69 7 23 0 0| 0 825k| 532B 532B:8211k 103M|1.00 0 |3711k 321k
69 7 23 0 0| 0 826k| 532B 532B:8257k 104M|1.60 0 |3728k 318k
**perf stat**
perf: 'stat-p' is not a perf-command. See 'perf --help'.
**perf report**
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:P'
# Event count (approx.): 5437933978065
#
# Overhead Shared Object Symbol IPC [IPC Coverage]
# .......... .................... .................................................. ....................
#
45.77% traffic_server [.] freelist_new(_InkFreeList*) - -
10.75% traffic_server [.] freelist_free(_InkFreeList*, void*) - -
1.54% traffic_server [.] ink_freelist_new(_InkFreeList*) - -
1.03% libc.so.6 [.] __memmove_avx_unaligned_erms - -
1.02% traffic_server [.] IOBufferBlock::clear() - -
0.93% libc.so.6 [.] _int_malloc - -
0.72% libquiche.so [.] <alloc::string::String as core::fmt::Write>::w - -
0.64% libquiche.so [.] core::fmt::write - -
0.62% traffic_server [.] thread_freeup(FreelistAllocator&, ProxyAllocat - -
0.59% [vdso] [.] __vdso_clock_gettime - -
0.51% traffic_server [.] QPACK::_encode_header(MIMEField const&, unsign - -
0.49% libc.so.6 [.] _int_free - -
0.46% [kernel.kallsyms] [k] __memcpy - -
0.37% libquiche.so [.] quiche::Connection::send_single - -
About 23% idle CPU and a lot of memory allocation happening.
It looks like we are not using proxy allocators for H3:
wkaras ~/LOCAL_REPOS/TS
$ grep 'ProxyAllocator.*http[23]' $(findsrc)
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2ClientSessionAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2ServerSessionAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2StreamAllocator;
wkaras ~/LOCAL_REPOS/TS
$
Proxy allocators reduce allocs/frees with class allocators, which require freelist operations. Are we benchmarking with -f , where the freelist new and free just call stdlib malloc and free?
There are allocators for for H3 and QUIC, although some are not used:
$ git grep ClassAllocator src/proxy/http3/
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3Frame> http3FrameAllocator("http3FrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3DataFrame> http3DataFrameAllocator("http3DataFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3HeadersFrame> http3HeadersFrameAllocator("http3HeadersFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3SettingsFrame> http3SettingsFrameAllocator("http3SettingsFrameAllocator");
$ git grep ClassAllocator src/iocore/net/ | grep -i quic
src/iocore/net/P_QUICNet.h:extern ClassAllocator<QUICPollEvent> quicPollEventAllocator;
src/iocore/net/P_QUICNetVConnection.h:extern ClassAllocator<QUICNetVConnection> quicNetVCAllocator;
src/iocore/net/QUICNet.cc:ClassAllocator<QUICPollEvent> quicPollEventAllocator("quicPollEvent");
src/iocore/net/QUICNetVConnection.cc:ClassAllocator<QUICNetVConnection> quicNetVCAllocator("quicNetVCAllocator");
But I don't think these ae the cause.
The heaviest user of freelist on the benchmark is probably udpPacketAllocator (and ioBlockAllocator used by UDPPacket class).
I see, it looks like there are proxy allocators named quic instead of http3:
wkaras ~/LOCAL_REPOS/TS
$ grep -F ProxyAllocator $(findsrc) | grep -Fi quic
./include/iocore/eventsystem/Thread.h: ProxyAllocator quicNetVCAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator quicClientSessionAllocator;
wkaras ~/LOCAL_REPOS/TS
$
Maybe we need a corresponding ProxyAllocator for the udpPacketAllocator class allocator. It would be interesting to compare the benchmark hotspots with and without -f .
Yes, that was suggested a few years ago, and nobody has tried it, even though that doesn't require any protocol knowledge.
I think we are not sure proxy and class allocators improve performance when using a general purpose allocator with per-thread arenas.
Yup, that's one of reasons I stopped using them for new code for H3 and QUIC (using them requires extra code, and there was an issue that constructors/destructors are not called).
Can you please share trafficserver's config (records.yaml) used for this benchmark?
And which branch was used for this benchmark? master or 10.0.x?
did some tests locally in my small nuc, couldn't a very get similar result, but allocation seems a hotspot anyway
https://gist.github.com/brbzull0/19bcd10f135057d66a9540581c8b54b6
This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.