iperf
iperf copied to clipboard
Add support for SKIP-RX-COPY and SO_ZEROCOPY/MSG_ZEROCOPY
-
Version of iperf3 (or development branch, such as
master
or3.1-STABLE
) to which this pull request applies: master -
Issues fixed (if any): #1678
-
Brief description of code changes (suitable for use as a commit message):
Add support for SKIP-RX-COPY (using MSG_TRUNC) and SO_ZEROCOPY/MSG_ZEROCOPY. Although it is not clear that all added functionality improve performance and throughput, support for all these options and their combinations was added, to allow testing all of them. The assumptions is that different environments may have different levels of support for the different options and their combinations.
(Note that running bootstrap.sh; configure
is required before make
to support the new features.)
The added options are:
-
--skip-rx-copy
: when used, for both TCP and UDP,recv(..., MSG_TRUNC)
is used instead ofread()
. - Support for
MSG_ZEROCOPY
. When used, socket optionSO_ZEROCOPY
is set andsend(...., MSG_ZEROCOPY)
is used instead ofwrite()
.MSG_ZEROCOPY
is used in the following cases: 2.1 UDP: when-Z/--zerocopy
option is set. 2.2 TCP: when--zerocopy=z
is set. Otherwise,sendfile()
continue to be used for TCP zero copy.
Thanks David, this looks very promissing.
Follows my results on a 100G back to back test environment:
Server command: sudo taskset --cpu-list 1 ./src/iperf3 -s -i 1 Client Comands: sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 126 GBytes 36.0 Gbits/sec 8 sender [ 5] 0.00-30.00 sec 126 GBytes 36.0 Gbits/sec receiver CPU Utilization: local/sender 40.2% (0.9%u/39.3%s), remote/receiver 99.7% (1.7%u/98.0%s)
sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 239 GBytes 68.4 Gbits/sec 0 sender [ 5] 0.00-30.00 sec 239 GBytes 68.4 Gbits/sec receiver CPU Utilization: local/sender 79.5% (1.8%u/77.8%s), remote/receiver 90.5% (4.4%u/86.0%s)
sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy --zerocopy=z [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 344 GBytes 98.5 Gbits/sec 0 sender [ 5] 0.00-30.00 sec 344 GBytes 98.5 Gbits/sec receiver CPU Utilization: local/sender 33.2% (2.1%u/31.1%s), remote/receiver 99.9% (6.4%u/93.5%s)
sudo taskset --cpu-list 1 ./src/iperf3 -c 192.168.1.18 -i 1 -t 30 -V --skip-rx-copy -Z [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 346 GBytes 99.0 Gbits/sec 0 sender [ 5] 0.00-30.00 sec 346 GBytes 99.0 Gbits/sec receiver CPU Utilization: local/sender 52.5% (2.2%u/50.3%s), remote/receiver 100.0% (6.5%u/93.4%s)
This is fantastic! Thanks! I get 2x throughput on a 100G path.
Question: Shouldn't --skip-rx-copy be a server side option instead of a client side? (or both if using the -R option)
Question: Shouldn't --skip-rx-copy be a server side option instead of a client side? (or both if using the -R option)
The client send this option (as several of the other options) to the server, so setting it for a test is applicable for both normal and reverse (-R
) modes. Making it a server option means that that there will be no way to set the option per test when the server is a receiver.
Ah, that makes sense. Thanks. I do see a use case where I might want to force it on the server side, but passing the option from the client is more useful.
I now understand why you asked to have the option on the server side, a use case that I didn't think about. I am not sure it is better to try adding such enhancement to this PR, or wait first that this (or other similar) PR will me merged and then suggest the enhancement by additional PR.
UPDATE: for now, I will not add the skip-rx-copy
to the server options, unless the reviewers will recommend differently. This is because such setting will have different behavior for the server as a receiver and for the client as a receiver.
On Fri, May 3, 2024 at 8:10 PM Brian Tierney @.***> wrote:
Ah, that makes sense. Thanks. I do see a use case where I might want to force it on the server side, but passing the option from the client is more useful.
— Reply to this email directly, view it on GitHub https://github.com/esnet/iperf/pull/1690#issuecomment-2093432881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOSCPP2NT3L7K4JHPEKHOU3ZAPAILAVCNFSM6AAAAABG4FECFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGQZTEOBYGE . You are receiving this because you authored the thread.Message ID: @.***>
Thanks for the pull request! We're gonna need to study this a bit.
Having read through the patch (haven't run anything yet) and associated Linux documentation:
- The
MSG_TRUNC
seems to affect TCP and UDP different. You have things set-up in a way that I would expect only TCP throughput to be improved. Has anyone seen an improvement in UDP throughput? - This iteration of changes is probably incompatible with the
--file
flag (specificallyMSG_ZEROCOPY
). To be compatible we would need to wait for the kernel to notify us that it is done with the shared buffer.
https://man7.org/linux/man-pages/man2/recv.2.html https://man7.org/linux/man-pages/man7/tcp.7.html https://docs.kernel.org/networking/msg_zerocopy.html
The
MSG_TRUNC
seems to affect TCP and UDP different. You have things set-up in a way that I would expect only TCP throughput to be improved. Has anyone seen an improvement in UDP throughput?
Why "things set-up in a way that I would expect only TCP throughput to be improved"? Although practically that may be the case, what is wrong in the implementation regarding UDP? (Both UDP and TCP use Nrecv()
with MSG_TRUNC
as socket-option.)
This iteration of changes is probably incompatible with the
--file
flag (specificallyMSG_ZEROCOPY
). To be compatible we would need to wait for the kernel to notify us that it is done with the shared buffer.
Forgot to take --file
into account. Without this option, the sent buffer is fixed, so there is no need to handle the kernel notifications. To simplify the initial solution, I now added a check to not allow the use of MSG_ZEROCOPY
when --file
is set.
Why "things set-up in a way that I would expect only TCP throughput to be improved"? Although practically that may be the case, what is wrong in the implementation regarding UDP? (Both UDP and TCP use Nrecv() with MSG_TRUNC as socket-option.)
If I understand the Linux kernel documentation correctly, recv(fd, buf, nleft, MSG_TRUNC)
will discard different parts of the kernel's buffer depending on if it is UDP or TCP.
With TCP it will discard nleft
from the kernel buffer.
With UDP it will discard everything after nleft
.
This means that we only have to read the first part of UDP packet to get the UDP stats the server sticks in there:
--- a/src/iperf_udp.c
+++ b/src/iperf_udp.c
@@ -69,6 +69,9 @@ iperf_udp_recv(struct iperf_stream *sp)
sock_opt = 0;
#endif /* HAVE_MSG_TRUNC */
+ if (sock_opt)
+ size = sizeof(sec) + sizeof(usec) + sizeof(pcount);
+
r = Nrecv(sp->socket, sp->buffer, size, Pudp, sock_opt);
/*
--- a/src/net.c
+++ b/src/net.c
@@ -397,8 +397,8 @@ Nread(int fd, char *buf, size_t count, int prot)
int
Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
{
- register ssize_t r;
- register size_t nleft = count;
+ register ssize_t r, total=0;
+ register ssize_t nleft = count;
struct iperf_time ftimeout = { 0, 0 };
fd_set rfdset;
@@ -441,6 +441,7 @@ Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
} else if (r == 0)
break;
+ total += r;
nleft -= r;
buf += r;
@@ -477,7 +478,7 @@ Nrecv(int fd, char *buf, size_t count, int prot, int sock_opt)
}
}
}
- return count - nleft;
+ return total;
}
Doing a quick loopback test on my system shows about a 20% increase in throughput for large UDP packets.
On a quick tangent, it looks like with this change UDP tests are pretty much dominated by the overhead of select.
Below is not using --skip-rx-copy
:
Below is using --skip-rx-copy
:
Forgot to take --file into account. Without this option, the sent buffer is fixed, so there is no need to handle the kernel notifications. To simplify the initial solution, I now added a check to not allow the use of MSG_ZEROCOPY when --file is set.
I wasn't aware of this in my original comment but since iperf
likes to insert UDP stats at the start of the buffer used for UDP packets, using MSG_ZEROCOPY
will create a similar race condition for UDP tests. This will cause iperf
to miss report lost datagrams.
If I understand the Linux kernel documentation correctly, recv(fd, buf, nleft, MSG_TRUNC) will discard different parts of the kernel's buffer depending on if it is UDP or TCP. ... With UDP it will discard everything after nleft. This means that we only have to read the first part of UDP packet to get the UDP stats the server sticks in there:
Thanks a lot! I completely missed that point. I now added you suggested change to iperf_udp_recv()
. I didn't make the suggested changes to Nrecv()
, since I understand that they are just cosmetic, and usually my approach is to make the minimum changes required, as this seems to be more safe (bugs introduction, portability, etc.).
The new commit is with rebase, and I forgot to mention previously that the changes include the PR #1708 fix, to reduce server's CPU overhead.
... UDP stats at the start of the buffer used for UDP packets, using MSG_ZEROCOPY will create a similar race condition for UDP tests.
As you probably understood, I somehow overlooked the UDP dynamic prefix of a packet ... For the receiving side you solved the issue with the above. For the sending side it seems that the only solution is using the MSG_ZEROCOPY Notifications, but I don't want to add this complexity at this point. It seems that initially it is better to just not implement zero copy for UDP. Before doing that, do you agree? Do you have other suggestion?
I agree that MSG_ZEROCOPY
won't work with UDP without notifications.
The math for Nread
/Nrecv
took me a second to reason through (because of the double negative). I.e With UDP_MAX
reads you get something like:
count = 16;
nleft = count;
...
nleft -= 65507; // nleft = (-65491)
...
return count - nleft; // 16 - (-65491) = 65507
(which does account for all the bits lol)
The changes to support SKIP-RX-COPY where moved to PR #1717, and this PR will be used only for the support of zero-copy using SO_ZEROCOPY/MSG_ZEROCOPY.
The math for Nread/Nrecv took me a second to reason through (because of the double negative). ....
I am not sure I realized that nleft
becomes negative, so for clarity I did make the suggested changes in PR #1717.
I agree that MSG_ZEROCOPY won't work with UDP without notifications.
Will add support for the notifications in this PR.
Added a separate PR #1720 for the support of MSG_ZEROCPY/SOZEROCOPY, now with notifications support.
Closing this PR as its functionality is now split between PRs #1717 and #1720.