iperf
iperf copied to clipboard
iperf3 single-stream low bandwidth with small message sizes (1KB, 1500B, 2000B, etc.)
Context
-
Version of iperf3: iperf 3.5
-
Hardware: Nvidia ConnectX-6 Dx 100GbE Intel XXV710 25GbE
-
Operating system (and distribution, if any): Red Hat Enterprise Linux Server release 7.5 (Maipo) - kernel 5.10.0-rc4 CentOS Linux release 8.1.1911 (Core) - kernel 5.2.9
Bug Report
We are observing low performance with iperf3 while sending single-stream traffic with small message sizes. Other benchmarks reports significant better results for same scenario, we assume it's Iperf3 that's limiting the performance we observe.
For example:
Running iperf3 -c fd12:fd07::200:105 -t 10 -l1500B
we see ~7.9Gbps.
Running the same test with 1 stream and 1500B with iperf2 we see 23Gbps which is what we would expect with current system.
cmd: iperf -c fd12:fd07::200:105 -V -l1500 -P1
- Expected Behavior
Iperrf3 should be able to achieve at least the same performance reported by iperf2 for a single-stream.
- Actual Behavior
Iperf3 small message size performance is low. Iperf3 is the limiting factor.
- Steps to Reproduce
Send traffic with message size 1500B (On a high performance NIC over 10/25Gbps).
- Possible Solution
I've noted iperf2 suffer a similiar issue when running with '-i' flag, might be related to the reporting/results gathering flow?
@noamsto, please re-run the test with the following options and let us know the results:
- Server verbose and debug (
-d -V
) option to see the size of the send/receive buffers. - If you suspect that the statistics reports may be an issue, run both server and client with no interval statistics:
-i 0
. - What is the throughput if you use the default message size
-l 128K
?
Hi @davidBar-On,
-
Attached the following output when running server command
iperf3 -Vsd
iperf3_sVd.txt -
I did suspect that, but it shows the same results for
-i0
for both client and server flag: -
With a default message size of 128K all is good:
Hi @noamsto, this is interesting. It is expected that throughput will decrease when message size is decreased. Therefore, it is not clear what iperf2 is doing to keep the same throughput with small messages.
Except for overhead in iperf3 internal processing of the TCP messages, there are two directions to investigate:
-
Window Size: can you try running iperf3 with larger window size, e.g.
-w 512K
? If that helps you can further increase the window size to see how it effects performance. -
Congestion: if increasing the window size doesn't have significant effect, then for some reason there may be a large delay in receiving the ACks from the server. That causes the client to retry sending packets, and if that happens, then throughput is significantly reduced. Do you have a way to log the network data (preferred using Wireshark) to see whether there are a lot of retries?
One more thing to try (in addition to the window-size and Wireshark above), is iperf3 burst. See issue #899. Sending packets in burst have less ipref3 internal overhead, so this may help to understand if the difference between iperf2 and iperf3 throughput is related to internal processing.
Can you try running the client with -b 0/100
option? This will cause the client to send bursts of 10 packets. If that seem to have an impact, try to increase the number of packets in the burst, e.g. -b 0/500
.
Hi @davidBar-On,
-
Tried changing the window size to 512k, it made a slight improvement, but it's still not good enough:
-
I can provide Wireshark output but as far as Iperf3 is reporting and
nstat | grep -i retrans
we have as little as 1 retransmission for the whole test (10s). -
Running with burst flag
-b 0/N
didn't increase the results as well.
Another indication that Iperf3 might have an issue here is that netperf reports a much higher BW for the same case as well:
@noamsto, thanks for the input. As none of the options I suggested help, maybe the issue is related to CPU usage by iperf3. Can you run both client and server with --verbose
option and send the reported CPU Utilization of both (reported by the client at the end of the test)? If the issue is related to iperf3 performance then %cpu should be high.
It would also help if you will use latest iperf3 version (3.9). Version 3.5 is from the beginning of 2018, and it would be difficult to evaluate the issue using tests inputs from a relatively old version.
Hi, @davidBar-On sorry for the long delay.
I've tested with version 3.9, still similar behavior, 128k -> ~20Gbps 1500B -> ~2Gbps Collected some CPU statistics:
Message Size | CPU % (TX) | CPU % (RX) |
---|---|---|
128K | 51% | 64.5% |
1500B | 39% | 58% |
Seems like the CPU is not working harder with 1500B (as we expect). Also, I've gathered the number of Interrupts fired on the TX side for both cases:
Message Size | Interrupts # (TX) |
---|---|
128K | >65k |
1500B | ~17K |
Here I would expect smaller message sizes -> more Interrupts.
Maybe Iperf3 is not generating enough work for the CPUs when the message size is small?
Maybe Iperf3 is not generating enough work for the CPUs when the message size is small?
I agree that somehow this is the case. The following two tests may help to get better insight about the issue:
-
Use several parallel streams - e.g.
-P 10
for 10 parallel streams. This is somewhat similar to increasing the bust size, but if the issue is related to a specific TCP stream (delayed acks, re-transmissions, etc.) there will be a difference in the total throughput. -
Try using UDP (
-u
) and see what is the maximum throughput on the receive side. That throughput may help to understand if the issue is related to TCP or other system limitations.
Hi, I recently hit this "issue", and had the chance of doing some debugging on the iperf3 implementation.
$ iperf3 -v
iperf 3.11 (cJSON 1.7.13)
Linux wonderland.rsevilla.org 6.2.10-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 6 23:30:41 UTC 2023 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, SCTP, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication, bind to device, support IPv4 don't fragment
The main problem with iperf3 in small packet size scenarios is that the iperf3's server implementation performs too many select
syscalls, namely one per packet received from the sender.
https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_server_api.c#L530-L534
These syscall's don't come for free and they have a CPU impact on the process. On the other hand, the client is not that affected by this behavior since multisend
is enabled and configured to 10 when no bandwidth rate is specified:
https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_api.c#L1882-L1888
https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_api.c#L3150
The above means that the client side will perform a ratio of 1:10 select per write/read unlike the server side where the ratio is 1:1.
Running a stupid test with we can probe this behavior.
$ iperf3 -l 64B localhost -t 5s -c localhost
Connecting to host localhost, port 5201
[ 5] local ::1 port 44422 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 42.7 MBytes 359 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 41.8 MBytes 350 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 41.1 MBytes 345 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 40.3 MBytes 338 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-5.00 sec 213 MBytes 357 Mbits/sec 0 sender
[ 5] 0.00-5.00 sec 211 MBytes 354 Mbits/sec receiver
iperf Done.
And tracing server side
$ sudo /usr/share/bcc/tools/syscount -L -p 1640422
Tracing syscalls, printing top 10... Ctrl+C to quit.
[12:17:37]
SYSCALL COUNT TIME (us)
read 3458349 1864759.911
pselect6 3459046 1759740.940
write 23 336.073
accept 2 63.444
As shown above the number of read sycalls is very close to the read ones, and they are adding a latency of ~1.756s to this 5s test.
Reducing the number of select syscalls the server side performs should be the way to go to optimize the performance of scenario.
As a side note, it's possible to improve the server's performance by configuring the select's timeout argument to NULL
or 0. (Which should reduce the "select" minimum polling interval to 0)
Default values:
$ taskset -c 1 iperf3 -l 64B localhost -t 30s -c localhost
Connecting to host localhost, port 5201
[ 5] local ::1 port 33346 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 60.8 MBytes 510 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 55.5 MBytes 466 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 54.7 MBytes 459 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 53.4 MBytes 448 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 52.3 MBytes 439 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 51.9 MBytes 435 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 51.7 MBytes 433 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 50.5 MBytes 424 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 51.5 MBytes 432 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 51.1 MBytes 429 Mbits/sec 0 320 KBytes
[ 5] 10.00-11.00 sec 51.0 MBytes 428 Mbits/sec 0 320 KBytes
[ 5] 11.00-12.00 sec 48.7 MBytes 408 Mbits/sec 0 320 KBytes
[ 5] 12.00-13.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 13.00-14.00 sec 48.6 MBytes 408 Mbits/sec 0 320 KBytes
[ 5] 14.00-15.00 sec 49.0 MBytes 411 Mbits/sec 0 320 KBytes
[ 5] 15.00-16.00 sec 48.0 MBytes 403 Mbits/sec 1 320 KBytes
[ 5] 16.00-17.00 sec 49.6 MBytes 416 Mbits/sec 0 320 KBytes
[ 5] 17.00-18.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 18.00-19.00 sec 49.2 MBytes 413 Mbits/sec 0 320 KBytes
[ 5] 19.00-20.00 sec 48.9 MBytes 410 Mbits/sec 0 320 KBytes
[ 5] 20.00-21.00 sec 48.2 MBytes 404 Mbits/sec 0 320 KBytes
[ 5] 21.00-22.00 sec 46.8 MBytes 393 Mbits/sec 0 320 KBytes
[ 5] 22.00-23.00 sec 46.6 MBytes 391 Mbits/sec 0 320 KBytes
[ 5] 23.00-24.00 sec 47.9 MBytes 402 Mbits/sec 0 320 KBytes
[ 5] 24.00-25.00 sec 48.6 MBytes 407 Mbits/sec 0 320 KBytes
[ 5] 25.00-26.00 sec 47.0 MBytes 394 Mbits/sec 0 320 KBytes
[ 5] 26.00-27.00 sec 47.3 MBytes 397 Mbits/sec 0 320 KBytes
[ 5] 27.00-28.00 sec 47.2 MBytes 396 Mbits/sec 0 320 KBytes
[ 5] 28.00-29.00 sec 44.6 MBytes 374 Mbits/sec 0 320 KBytes
[ 5] 29.00-30.00 sec 46.4 MBytes 389 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.46 GBytes 417 Mbits/sec 1 sender
[ 5] 0.00-30.00 sec 1.45 GBytes 416 Mbits/sec receiver
With this small patch:
$ git diff
diff --git a/src/iperf_server_api.c b/src/iperf_server_api.c
index 18f105d..3c7f637 100644
--- a/src/iperf_server_api.c
+++ b/src/iperf_server_api.c
@@ -516,8 +516,8 @@ iperf_run_server(struct iperf_test *test)
} else if (test->mode != SENDER) { // In non-reverse active mode server ensures data is received
timeout_us = -1;
if (timeout != NULL) {
- used_timeout.tv_sec = timeout->tv_sec;
- used_timeout.tv_usec = timeout->tv_usec;
+ used_timeout.tv_sec = 0;
+ used_timeout.tv_usec = 0;
timeout_us = (timeout->tv_sec * SEC_TO_US) + timeout->tv_usec;
}
if (timeout_us < 0 || timeout_us > rcv_timeout_us) {
Client-side
taskset -c 1 iperf3 -l 64B localhost -t 30s -c localhost
Connecting to host localhost, port 5201 [ 5] local ::1 port 33844 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 77.3 MBytes 649 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 74.1 MBytes 621 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 67.6 MBytes 567 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 69.5 MBytes 583 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 68.8 MBytes 577 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 67.2 MBytes 564 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 66.8 MBytes 561 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 62.9 MBytes 528 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 64.4 MBytes 540 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 65.1 MBytes 546 Mbits/sec 0 320 KBytes
[ 5] 10.00-11.00 sec 63.4 MBytes 532 Mbits/sec 0 320 KBytes
[ 5] 11.00-12.00 sec 64.6 MBytes 542 Mbits/sec 0 320 KBytes
[ 5] 12.00-13.00 sec 64.1 MBytes 537 Mbits/sec 0 320 KBytes
[ 5] 13.00-14.00 sec 64.2 MBytes 538 Mbits/sec 0 320 KBytes
[ 5] 14.00-15.00 sec 64.0 MBytes 537 Mbits/sec 0 320 KBytes
[ 5] 15.00-16.00 sec 62.1 MBytes 521 Mbits/sec 0 320 KBytes
[ 5] 16.00-17.00 sec 60.4 MBytes 507 Mbits/sec 0 320 KBytes
[ 5] 17.00-18.00 sec 62.2 MBytes 522 Mbits/sec 0 320 KBytes
[ 5] 18.00-19.00 sec 62.4 MBytes 523 Mbits/sec 0 320 KBytes
[ 5] 19.00-20.00 sec 61.5 MBytes 516 Mbits/sec 0 320 KBytes
[ 5] 20.00-21.00 sec 60.9 MBytes 511 Mbits/sec 0 320 KBytes
[ 5] 21.00-22.00 sec 62.7 MBytes 526 Mbits/sec 0 320 KBytes
[ 5] 22.00-23.00 sec 61.7 MBytes 517 Mbits/sec 0 320 KBytes
[ 5] 23.00-24.00 sec 61.9 MBytes 519 Mbits/sec 0 320 KBytes
[ 5] 24.00-25.00 sec 62.4 MBytes 523 Mbits/sec 0 320 KBytes
[ 5] 25.00-26.00 sec 61.8 MBytes 519 Mbits/sec 0 320 KBytes
[ 5] 26.00-27.00 sec 56.1 MBytes 471 Mbits/sec 0 320 KBytes
[ 5] 27.00-28.00 sec 62.0 MBytes 520 Mbits/sec 0 320 KBytes
[ 5] 28.00-29.00 sec 61.8 MBytes 518 Mbits/sec 0 320 KBytes
[ 5] 29.00-30.00 sec 61.5 MBytes 516 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.88 GBytes 538 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 1.88 GBytes 538 Mbits/sec receiver
I haven't analyzed the impact this change could have on other workload, so just keep it as an example.
Default | Patched | Delta |
---|---|---|
417 Mbits/sec | 538 Mbits/sec | 29% |
Hi @rsevilla87, very good and useful analysis! I tried your suggested change on my computer, and indeed the throughput is increased dramatically (in my case from 70Mbps to 93Mbps).
From what you found I think that a "receiving burst" option should be added to iperf3. I.e., that the select()
timeout will be used (non zero) only every "burst" number of times. Would you like to submit a PR with such proposed changes?
If you will submit such PR, please note the following:
- I suggest to add the receive burst number as a third optional parameter to the
-b
option (the sending burst number is the second optional parameter). I.e.,-b #[KMG][/#][/#]
. Note that none of the values should be mandatory, i.e.//5
should be possible for setting only the receive burst to 5.-b
code is here. Help should also be updated here. - The default of the receive counter should be 1, to keep backward compatibility (see the places where
settings->burst
is set). - get/set functions should be added to the new option.
- The
used_timeout
should not be set to zero every read burst times (for the default every 1 times it will never be set to zero).
There's some interesting and worthy analysis going on here!
I kind of wonder if the multi-threaded iperf3 (on the mt
branch, currently in public beta, and eventually planned to be merged to the main codeline) is going to render this moot, or at least sufficiently change the problem (and maybe the solution). So it might be better to hold off on trying to fix this problem inside the current iperf3 implementation you see on the master
branch.
To wit:
According to above, one of the leading factors limiting iperf3 performance is a large number of select(2) calls and their impact on the sending of test data. This comes directly from an early design decision to have iperf3 run as a single thread. Because of this, the iperf3 process can't block in send() or recv() type system calls, because there are multiple sockets that need servicing, as well as various timers. This basically forces the use of select(2) with some timeout values.
The multi-threaded iperf3 assigns a different thread to every I/O stream. Because every stream/connection has its own dedicated thread, that thread can be allowed to block and we no longer need to do select(2) calls inside the threads doing I/O. We only use select(2) in the main thread, which manages the control connection and reporting.
Note that in general, small messages will still be less efficient than larger ones. That's generally true for almost all I/O. In fact, there are iperf3 use cases that rely on this behavior to simulate different applications' performance.
@davidBar-On @bmah888 Thanks for your thoughts!, to give you some context,the root of this issue is that I have been trying to characterize network throughput/latency performance in different scenarios by comparing the results from different perf tools like netperf, uperf or iperf3. Turned out in a much lower performance in small packet sizes scenarios as compared with netperf or uperf. As I demonstrate below, the maximum throughput achieved by a single-threaded test is around 4.9 Gpbs with both tools, however, I had to increase the packet size up to 8192 bytes on the iperf3 client to achieve a similar performance
Keep in mind that the uperf test was also single-threaded
I've taken a look at the source code of these tools to find the main differences on the receiver side and these tools are not using select to poll the socket fd.
I wonder why iperf3 uses it?, I think the server side could avoid of such amount of select syscalls as read() is an already blocking operation that waits for the socket data to be available
I believe I found the root cause for the iperf3 low performance with small messages sizes. While iperf3 use the same send and receive messages sizes, iperf2 has different messages length for the client and the server. That is, although the iperf2 client sends 1500 bytes messages, the server receives 128KB (the default size) messages. I believe netperf behavior is similar, based on the 13K "Recv socket size bytes" and the 1500 "Send message size bytes" in its report titles.
I tried a version of iperf3 that reads 10 times the message size, i.e. sending 1500 bytes messages and receiving 15,000 bytes messages. Throughput was improved by 35% for a single stream test and over 50% for multi-stream tests.
Submitted PR #1691 with a suggested enhancement - TCP receive reads each time "burst * message length" bytes messages.