Context

Version of iperf3: iperf 3.5
Hardware: Nvidia ConnectX-6 Dx 100GbE Intel XXV710 25GbE
Operating system (and distribution, if any): Red Hat Enterprise Linux Server release 7.5 (Maipo) - kernel 5.10.0-rc4 CentOS Linux release 8.1.1911 (Core) - kernel 5.2.9

Bug Report

We are observing low performance with iperf3 while sending single-stream traffic with small message sizes. Other benchmarks reports significant better results for same scenario, we assume it's Iperf3 that's limiting the performance we observe.

For example: Running iperf3 -c fd12:fd07::200:105 -t 10 -l1500B we see ~7.9Gbps. iperf3_1stream_1500B

Running the same test with 1 stream and 1500B with iperf2 we see 23Gbps which is what we would expect with current system. cmd: iperf -c fd12:fd07::200:105 -V -l1500 -P1 iperf2_1stream_1500B

Expected Behavior

Iperrf3 should be able to achieve at least the same performance reported by iperf2 for a single-stream.

Actual Behavior

Iperf3 small message size performance is low. Iperf3 is the limiting factor.

Steps to Reproduce

Send traffic with message size 1500B (On a high performance NIC over 10/25Gbps).

Possible Solution

I've noted iperf2 suffer a similiar issue when running with '-i' flag, might be related to the reporting/results gathering flow?

Nov 24 '20 12:11 noamsto

@noamsto, please re-run the test with the following options and let us know the results:

Server verbose and debug (-d -V) option to see the size of the send/receive buffers.
If you suspect that the statistics reports may be an issue, run both server and client with no interval statistics: -i 0.
What is the throughput if you use the default message size -l 128K?

Nov 25 '20 12:11 davidBar-On

Hi @davidBar-On,

Attached the following output when running server command iperf3 -Vsd iperf3_sVd.txt
I did suspect that, but it shows the same results for -i0 for both client and server flag:
With a default message size of 128K all is good:

Dec 01 '20 08:12 noamsto

Hi @noamsto, this is interesting. It is expected that throughput will decrease when message size is decreased. Therefore, it is not clear what iperf2 is doing to keep the same throughput with small messages.

Except for overhead in iperf3 internal processing of the TCP messages, there are two directions to investigate:

Window Size: can you try running iperf3 with larger window size, e.g. -w 512K? If that helps you can further increase the window size to see how it effects performance.
Congestion: if increasing the window size doesn't have significant effect, then for some reason there may be a large delay in receiving the ACks from the server. That causes the client to retry sending packets, and if that happens, then throughput is significantly reduced. Do you have a way to log the network data (preferred using Wireshark) to see whether there are a lot of retries?

Dec 01 '20 13:12 davidBar-On

One more thing to try (in addition to the window-size and Wireshark above), is iperf3 burst. See issue #899. Sending packets in burst have less ipref3 internal overhead, so this may help to understand if the difference between iperf2 and iperf3 throughput is related to internal processing.

Can you try running the client with -b 0/100 option? This will cause the client to send bursts of 10 packets. If that seem to have an impact, try to increase the number of packets in the burst, e.g. -b 0/500.

Dec 02 '20 16:12 davidBar-On

Hi @davidBar-On,

Tried changing the window size to 512k, it made a slight improvement, but it's still not good enough:
I can provide Wireshark output but as far as Iperf3 is reporting and nstat | grep -i retrans we have as little as 1 retransmission for the whole test (10s).
Running with burst flag -b 0/N didn't increase the results as well.

Another indication that Iperf3 might have an issue here is that netperf reports a much higher BW for the same case as well:

Dec 03 '20 15:12 noamsto

@noamsto, thanks for the input. As none of the options I suggested help, maybe the issue is related to CPU usage by iperf3. Can you run both client and server with --verbose option and send the reported CPU Utilization of both (reported by the client at the end of the test)? If the issue is related to iperf3 performance then %cpu should be high.

It would also help if you will use latest iperf3 version (3.9). Version 3.5 is from the beginning of 2018, and it would be difficult to evaluate the issue using tests inputs from a relatively old version.

Dec 03 '20 18:12 davidBar-On

Hi, @davidBar-On sorry for the long delay.

I've tested with version 3.9, still similar behavior, 128k -> ~20Gbps 1500B -> ~2Gbps Collected some CPU statistics:

Message Size	CPU % (TX)	CPU % (RX)
128K	51%	64.5%
1500B	39%	58%

Seems like the CPU is not working harder with 1500B (as we expect). Also, I've gathered the number of Interrupts fired on the TX side for both cases:

Message Size	Interrupts # (TX)
128K	>65k
1500B	~17K

Here I would expect smaller message sizes -> more Interrupts.

Maybe Iperf3 is not generating enough work for the CPUs when the message size is small?

Feb 09 '21 12:02 noamsto

Maybe Iperf3 is not generating enough work for the CPUs when the message size is small?

I agree that somehow this is the case. The following two tests may help to get better insight about the issue:

Use several parallel streams - e.g. -P 10 for 10 parallel streams. This is somewhat similar to increasing the bust size, but if the issue is related to a specific TCP stream (delayed acks, re-transmissions, etc.) there will be a difference in the total throughput.
Try using UDP (-u) and see what is the maximum throughput on the receive side. That throughput may help to understand if the issue is related to TCP or other system limitations.

Feb 09 '21 14:02 davidBar-On

Hi, I recently hit this "issue", and had the chance of doing some debugging on the iperf3 implementation.

$ iperf3 -v
iperf 3.11 (cJSON 1.7.13)
Linux wonderland.rsevilla.org 6.2.10-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr  6 23:30:41 UTC 2023 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, SCTP, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication, bind to device, support IPv4 don't fragment

The main problem with iperf3 in small packet size scenarios is that the iperf3's server implementation performs too many select syscalls, namely one per packet received from the sender.

https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_server_api.c#L530-L534

These syscall's don't come for free and they have a CPU impact on the process. On the other hand, the client is not that affected by this behavior since multisend is enabled and configured to 10 when no bandwidth rate is specified:

https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_api.c#L1882-L1888

https://github.com/esnet/iperf/blob/10b1797714c231f30b354cd3335cf1d709bc4904/src/iperf_api.c#L3150

The above means that the client side will perform a ratio of 1:10 select per write/read unlike the server side where the ratio is 1:1.

Running a stupid test with we can probe this behavior.

$  iperf3  -l 64B localhost  -t 5s  -c localhost   
Connecting to host localhost, port 5201
[  5] local ::1 port 44422 connected to ::1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.7 MBytes   392 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  42.7 MBytes   359 Mbits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  41.8 MBytes   350 Mbits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  41.1 MBytes   345 Mbits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  40.3 MBytes   338 Mbits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec   213 MBytes   357 Mbits/sec    0             sender
[  5]   0.00-5.00   sec   211 MBytes   354 Mbits/sec                  receiver


iperf Done.

And tracing server side

$ sudo  /usr/share/bcc/tools/syscount -L -p 1640422
Tracing syscalls, printing top 10... Ctrl+C to quit.
[12:17:37]
SYSCALL                   COUNT        TIME (us)
read                    3458349      1864759.911
pselect6                3459046      1759740.940
write                        23          336.073
accept                        2           63.444

As shown above the number of read sycalls is very close to the read ones, and they are adding a latency of ~1.756s to this 5s test.

Reducing the number of select syscalls the server side performs should be the way to go to optimize the performance of scenario.

May 03 '23 10:05 rsevilla87

As a side note, it's possible to improve the server's performance by configuring the select's timeout argument to NULL or 0. (Which should reduce the "select" minimum polling interval to 0)

Default values:

$ taskset -c 1 iperf3  -l 64B localhost  -t 30s  -c localhost    
Connecting to host localhost, port 5201
[  5] local ::1 port 33346 connected to ::1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  60.8 MBytes   510 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  55.5 MBytes   466 Mbits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  54.7 MBytes   459 Mbits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  53.4 MBytes   448 Mbits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  52.3 MBytes   439 Mbits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  51.9 MBytes   435 Mbits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  51.7 MBytes   433 Mbits/sec    0    320 KBytes       
[  5]   7.00-8.00   sec  50.5 MBytes   424 Mbits/sec    0    320 KBytes       
[  5]   8.00-9.00   sec  51.5 MBytes   432 Mbits/sec    0    320 KBytes       
[  5]   9.00-10.00  sec  51.1 MBytes   429 Mbits/sec    0    320 KBytes       
[  5]  10.00-11.00  sec  51.0 MBytes   428 Mbits/sec    0    320 KBytes       
[  5]  11.00-12.00  sec  48.7 MBytes   408 Mbits/sec    0    320 KBytes       
[  5]  12.00-13.00  sec  46.7 MBytes   392 Mbits/sec    0    320 KBytes       
[  5]  13.00-14.00  sec  48.6 MBytes   408 Mbits/sec    0    320 KBytes       
[  5]  14.00-15.00  sec  49.0 MBytes   411 Mbits/sec    0    320 KBytes       
[  5]  15.00-16.00  sec  48.0 MBytes   403 Mbits/sec    1    320 KBytes       
[  5]  16.00-17.00  sec  49.6 MBytes   416 Mbits/sec    0    320 KBytes       
[  5]  17.00-18.00  sec  46.7 MBytes   392 Mbits/sec    0    320 KBytes       
[  5]  18.00-19.00  sec  49.2 MBytes   413 Mbits/sec    0    320 KBytes       
[  5]  19.00-20.00  sec  48.9 MBytes   410 Mbits/sec    0    320 KBytes       
[  5]  20.00-21.00  sec  48.2 MBytes   404 Mbits/sec    0    320 KBytes       
[  5]  21.00-22.00  sec  46.8 MBytes   393 Mbits/sec    0    320 KBytes       
[  5]  22.00-23.00  sec  46.6 MBytes   391 Mbits/sec    0    320 KBytes       
[  5]  23.00-24.00  sec  47.9 MBytes   402 Mbits/sec    0    320 KBytes       
[  5]  24.00-25.00  sec  48.6 MBytes   407 Mbits/sec    0    320 KBytes       
[  5]  25.00-26.00  sec  47.0 MBytes   394 Mbits/sec    0    320 KBytes       
[  5]  26.00-27.00  sec  47.3 MBytes   397 Mbits/sec    0    320 KBytes       
[  5]  27.00-28.00  sec  47.2 MBytes   396 Mbits/sec    0    320 KBytes       
[  5]  28.00-29.00  sec  44.6 MBytes   374 Mbits/sec    0    320 KBytes       
[  5]  29.00-30.00  sec  46.4 MBytes   389 Mbits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  1.46 GBytes   417 Mbits/sec    1             sender
[  5]   0.00-30.00  sec  1.45 GBytes   416 Mbits/sec                  receiver

With this small patch:

$ git diff
diff --git a/src/iperf_server_api.c b/src/iperf_server_api.c
index 18f105d..3c7f637 100644
--- a/src/iperf_server_api.c
+++ b/src/iperf_server_api.c
@@ -516,8 +516,8 @@ iperf_run_server(struct iperf_test *test)
         } else if (test->mode != SENDER) {     // In non-reverse active mode server ensures data is received
             timeout_us = -1;
             if (timeout != NULL) {
-                used_timeout.tv_sec = timeout->tv_sec;
-                used_timeout.tv_usec = timeout->tv_usec;
+                used_timeout.tv_sec = 0;
+                used_timeout.tv_usec = 0;
                 timeout_us = (timeout->tv_sec * SEC_TO_US) + timeout->tv_usec;
             }
             if (timeout_us < 0 || timeout_us > rcv_timeout_us) {

Client-side

 taskset -c 1 iperf3  -l 64B localhost  -t 30s  -c localhost            
Connecting to host localhost, port 5201                                                                              [  5] local ::1 port 33844 connected to ::1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd                                                       
[  5]   0.00-1.00   sec  77.3 MBytes   649 Mbits/sec    0    320 KBytes                                                
[  5]   1.00-2.00   sec  74.1 MBytes   621 Mbits/sec    0    320 KBytes                                                
[  5]   2.00-3.00   sec  67.6 MBytes   567 Mbits/sec    0    320 KBytes                                                
[  5]   3.00-4.00   sec  69.5 MBytes   583 Mbits/sec    0    320 KBytes                                                
[  5]   4.00-5.00   sec  68.8 MBytes   577 Mbits/sec    0    320 KBytes                                                
[  5]   5.00-6.00   sec  67.2 MBytes   564 Mbits/sec    0    320 KBytes                                                
[  5]   6.00-7.00   sec  66.8 MBytes   561 Mbits/sec    0    320 KBytes                                                
[  5]   7.00-8.00   sec  62.9 MBytes   528 Mbits/sec    0    320 KBytes                                                
[  5]   8.00-9.00   sec  64.4 MBytes   540 Mbits/sec    0    320 KBytes                                                
[  5]   9.00-10.00  sec  65.1 MBytes   546 Mbits/sec    0    320 KBytes                                                
[  5]  10.00-11.00  sec  63.4 MBytes   532 Mbits/sec    0    320 KBytes                                                
[  5]  11.00-12.00  sec  64.6 MBytes   542 Mbits/sec    0    320 KBytes                                                
[  5]  12.00-13.00  sec  64.1 MBytes   537 Mbits/sec    0    320 KBytes                                                
[  5]  13.00-14.00  sec  64.2 MBytes   538 Mbits/sec    0    320 KBytes                                                
[  5]  14.00-15.00  sec  64.0 MBytes   537 Mbits/sec    0    320 KBytes                                                
[  5]  15.00-16.00  sec  62.1 MBytes   521 Mbits/sec    0    320 KBytes                                                
[  5]  16.00-17.00  sec  60.4 MBytes   507 Mbits/sec    0    320 KBytes                                                
[  5]  17.00-18.00  sec  62.2 MBytes   522 Mbits/sec    0    320 KBytes                                                
[  5]  18.00-19.00  sec  62.4 MBytes   523 Mbits/sec    0    320 KBytes                                                
[  5]  19.00-20.00  sec  61.5 MBytes   516 Mbits/sec    0    320 KBytes                                                
[  5]  20.00-21.00  sec  60.9 MBytes   511 Mbits/sec    0    320 KBytes                                                
[  5]  21.00-22.00  sec  62.7 MBytes   526 Mbits/sec    0    320 KBytes                                                
[  5]  22.00-23.00  sec  61.7 MBytes   517 Mbits/sec    0    320 KBytes                                                
[  5]  23.00-24.00  sec  61.9 MBytes   519 Mbits/sec    0    320 KBytes                                                
[  5]  24.00-25.00  sec  62.4 MBytes   523 Mbits/sec    0    320 KBytes                                                
[  5]  25.00-26.00  sec  61.8 MBytes   519 Mbits/sec    0    320 KBytes                                                
[  5]  26.00-27.00  sec  56.1 MBytes   471 Mbits/sec    0    320 KBytes                                                
[  5]  27.00-28.00  sec  62.0 MBytes   520 Mbits/sec    0    320 KBytes                                                
[  5]  28.00-29.00  sec  61.8 MBytes   518 Mbits/sec    0    320 KBytes                                                
[  5]  29.00-30.00  sec  61.5 MBytes   516 Mbits/sec    0    320 KBytes                                                
- - - - - - - - - - - - - - - - - - - - - - - - -          
[ ID] Interval           Transfer     Bitrate         Retr                                                             
[  5]   0.00-30.00  sec  1.88 GBytes   538 Mbits/sec    0             sender                                           
[  5]   0.00-30.00  sec  1.88 GBytes   538 Mbits/sec                  receiver

I haven't analyzed the impact this change could have on other workload, so just keep it as an example.

Default	Patched	Delta
417 Mbits/sec	538 Mbits/sec	29%

May 03 '23 14:05 rsevilla87

Hi @rsevilla87, very good and useful analysis! I tried your suggested change on my computer, and indeed the throughput is increased dramatically (in my case from 70Mbps to 93Mbps).

From what you found I think that a "receiving burst" option should be added to iperf3. I.e., that the select() timeout will be used (non zero) only every "burst" number of times. Would you like to submit a PR with such proposed changes?

If you will submit such PR, please note the following:

I suggest to add the receive burst number as a third optional parameter to the -b option (the sending burst number is the second optional parameter). I.e., -b #[KMG][/#][/#]. Note that none of the values should be mandatory, i.e. //5 should be possible for setting only the receive burst to 5. -b code is here. Help should also be updated here.
The default of the receive counter should be 1, to keep backward compatibility (see the places where settings->burst is set).
get/set functions should be added to the new option.
The used_timeout should not be set to zero every read burst times (for the default every 1 times it will never be set to zero).

May 04 '23 08:05 davidBar-On

There's some interesting and worthy analysis going on here!

I kind of wonder if the multi-threaded iperf3 (on the mt branch, currently in public beta, and eventually planned to be merged to the main codeline) is going to render this moot, or at least sufficiently change the problem (and maybe the solution). So it might be better to hold off on trying to fix this problem inside the current iperf3 implementation you see on the master branch.

To wit:

According to above, one of the leading factors limiting iperf3 performance is a large number of select(2) calls and their impact on the sending of test data. This comes directly from an early design decision to have iperf3 run as a single thread. Because of this, the iperf3 process can't block in send() or recv() type system calls, because there are multiple sockets that need servicing, as well as various timers. This basically forces the use of select(2) with some timeout values.

The multi-threaded iperf3 assigns a different thread to every I/O stream. Because every stream/connection has its own dedicated thread, that thread can be allowed to block and we no longer need to do select(2) calls inside the threads doing I/O. We only use select(2) in the main thread, which manages the control connection and reporting.

Note that in general, small messages will still be less efficient than larger ones. That's generally true for almost all I/O. In fact, there are iperf3 use cases that rely on this behavior to simulate different applications' performance.

May 05 '23 15:05 bmah888

@davidBar-On @bmah888 Thanks for your thoughts!, to give you some context,the root of this issue is that I have been trying to characterize network throughput/latency performance in different scenarios by comparing the results from different perf tools like netperf, uperf or iperf3. Turned out in a much lower performance in small packet sizes scenarios as compared with netperf or uperf. As I demonstrate below, the maximum throughput achieved by a single-threaded test is around 4.9 Gpbs with both tools, however, I had to increase the packet size up to 8192 bytes on the iperf3 client to achieve a similar performance

Keep in mind that the uperf test was also single-threaded

I've taken a look at the source code of these tools to find the main differences on the receiver side and these tools are not using select to poll the socket fd.

I wonder why iperf3 uses it?, I think the server side could avoid of such amount of select syscalls as read() is an already blocking operation that waits for the socket data to be available

May 09 '23 10:05 rsevilla87

I believe I found the root cause for the iperf3 low performance with small messages sizes. While iperf3 use the same send and receive messages sizes, iperf2 has different messages length for the client and the server. That is, although the iperf2 client sends 1500 bytes messages, the server receives 128KB (the default size) messages. I believe netperf behavior is similar, based on the 13K "Recv socket size bytes" and the 1500 "Send message size bytes" in its report titles.

I tried a version of iperf3 that reads 10 times the message size, i.e. sending 1500 bytes messages and receiving 15,000 bytes messages. Throughput was improved by 35% for a single stream test and over 50% for multi-stream tests.

Submitted PR #1691 with a suggested enhancement - TCP receive reads each time "burst * message length" bytes messages.

Apr 28 '24 17:04 davidBar-On

iperf
iperf copied to clipboard

iperf3 single-stream low bandwidth with small message sizes (1KB, 1500B, 2000B, etc.)

Context

Bug Report

iperf iperf copied to clipboard

iperf3 single-stream low bandwidth with small message sizes (1KB, 1500B, 2000B, etc.)

Context

Bug Report

iperf
iperf copied to clipboard