Description

After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage. Replacing it with an open source faster implementation can make significant improvement in performance.

Performance Test

Setup

UDPspeeder client: running on test machine A is a 1Core 1GiB VM running Debian 12 on my Proxmox VE NAS, which is mostly idling during the test, and have a Ryzen 3 4350G CPU.
UDPspeeder server and iperf3 server: running on server B is running Debian 12 in a VM running on my Windows Hyper-V PC.
ipserf3 client: running on my PC directly which also hosts server B.

Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.

Script used

Simulating delay and loss

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms loss 0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.133 flowid 1:3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.132 flowid 1:3

UDPspeeder command lines

./speederv2_amd64 -s -l0.0.0.0:5301 -r127.0.0.1:5201
time ./speederv2_branch_libev -c -l0.0.0.0:5201 -r192.168.1.133:5301
time ./speederv2_crc -c -l0.0.0.0:5201 -r192.168.1.133:5301

iperf command lines

.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 --length 1400
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 -R --length 1400

Test results

"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.

		branch_libev	crc32 PR	branch_libev	crc32 PR
ping=delay*2	Comments	Send		Receive(-R)
delay0/loss0 100M/length1400	Ideal LAN or same city	real 1m4.002s user 0m8.940s sys 0m7.169s	real 1m12.985s user 0m5.239s sys 0m6.836s	real 1m4.562s user 0m5.966s sys 0m4.461s	real 1m10.889s user 0m2.646s sys 0m4.357s
delay10/loss1 100M/length1400	ping=20 with 1% loss A pretty good China-HK/KR/JP network	real 1m7.122s user 0m9.165s sys 0m4.124s	real 1m10.473s user 0m5.827s sys 0m3.216s	real 1m2.657s user 0m5.850s sys 0m4.625s	real 1m2.515s user 0m2.633s sys 0m4.435s
delay10/loss1 50M/length200	Testing small packets	real 1m14.906s user 0m7.325s sys 0m3.939s	real 1m3.129s user 0m5.751s sys 0m3.155s	real 1m6.626s user 0m3.138s sys 0m9.732s	real 1m2.153s user 0m2.458s sys 0m9.062s
delay30/loss5 100M/length1400	ping=60 with 5% loss A less ideal network within Asia	real 1m5.962s user 0m9.277s sys 0m7.061s	real 1m3.250s user 0m5.494s sys 0m4.875s	real 1m7.642s user 0m6.329s sys 0m4.158s	real 1m3.010s user 0m2.457s sys 0m4.704s
delay30/loss5 50M/length200	Testing small packets	real 1m33.371s user 0m7.033s sys 0m5.237s	real 1m7.538s user 0m5.097s sys 0m3.929s	real 1m3.274s user 0m2.006s sys 0m11.125s	real 1m3.346s user 0m1.585s sys 0m9.966s
delay80/loss10 50M/length1400	ping=160 with 10% loss Pretty bad network across the Pacific Usually aiming low cost Web browsing	real 1m5.898s user 0m4.867s sys 0m2.093s	real 1m4.362s user 0m3.054s sys 0m1.760s	real 1m9.226s user 0m3.642s sys 0m1.938s	real 1m4.579s user 0m1.189s sys 0m2.671s

$%9GCR@R$7_U~8NSG{NZ670X$

BIG ENDIAN Validation

TODO

Flame Graph

TODO

Jun 16 '24 14:06 lsylsy2

@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance?

Jun 16 '24 15:06 lsylsy2

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

Jun 17 '24 04:06 wangyu-

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

Jun 17 '24 04:06 wangyu-

the best way is probably flame graph:

If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR.

Jun 17 '24 05:06 wangyu-

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

Jun 17 '24 05:06 wangyu-

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops. BTW, optimizing the XOR encryption can also improve the performance by 3~10% by utilizing 64bit operations, but it's written by me and have not been tested on multiple platforms (and also need to modify to support 32bit systems, etc.), so I'll not submit it very soon. https://github.com/wangyu-/UDPspeeder/compare/branch_libev...lsylsy2:UDPspeeder:2406_optimization

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. https://github.com/stbrumme/crc32/pull/8 Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Jun 17 '24 05:06 lsylsy2

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Jun 17 '24 05:06 lsylsy2

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

Jun 17 '24 06:06 wangyu-

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

Jun 17 '24 06:06 wangyu-

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role. I'll try to make more tests and send the result later.

Jun 17 '24 06:06 lsylsy2

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

Server: Oracle ARM VPS in Osaka Client (running perf and generating this graph): a single-core virtual machine on a mostly idling AMD Ryzen 3 pro 4350g, which single thread performance should be similar to Ryzen 5 3600 or i3-10100. The test was ran that sending and receiving was ran for both 60 seconds, each with 100Mb(it)ps throughput, however I forget if I set the packet size to 200/400 or the default ~1400.

Jun 17 '24 06:06 lsylsy2

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Forgot to say. tc and netem is acutally easier on simulate packet loss.

Here is some example code piece:

DEV=ens5

# turn driver optimizations off
sudo ethtool -K $DEV gro off
sudo ethtool -K $DEV tso off
sudo ethtool -K $DEV gso off

sudo tc qdisc del dev $DEV root
sudo tc qdisc add dev $DEV root netem loss 5.5%

(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo)

Jun 17 '24 06:06 wangyu-

Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64). Later adding flame graph comparison and BIG ENDIAN validation

Jun 22 '24 09:06 lsylsy2

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
         tunnel_client.cpp
         tunnel_server.cpp
         my_ev.cpp
+       crc32/Crc32.cpp
 )
 set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead.

Oct 10 '24 18:10 tofurky

Thank you for the response and improvement, I was busy on private work and not done the BIG ENDIAN test. Let me see if I can try and include the other improvements

tofurky @.***> 于2024年10月11日周五 02:32写道：

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake :

diff --git a/CMakeLists.txt b/CMakeLists.txt index d6b11ef..ca34fe0 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,7 @@ set(SOURCE_FILES tunnel_client.cpp tunnel_server.cpp my_ev.cpp
  crc32/Crc32.cpp
) set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

— Reply to this email directly, view it on GitHub https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUTWIR2HGVOFANR2PHEITZ23B23AVCNFSM6AAAAABJMU6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVG44DIOJZGM . You are receiving this because you were mentioned.Message ID: @.***>

Oct 12 '24 04:10 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

Jan 02 '25 07:01 xAJx1383

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

By installing qemu-user-static (as simple as apt install qemu-user-static in debian), all linux builds can be easily run on amd64 PC. Let me do the test soon

Edit: user qemu-user, not the old qemu-user-static

Mar 21 '25 06:03 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?

Mar 21 '25 06:03 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?

they heavily optimized it for x86 environments but there is some basic optimization which affects all environments. for x86 because there is many SMID extension they tried to take full advantage of that

Mar 23 '25 08:03 xAJx1383

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

Before further tests: The lib has a github mirror and it has BIG ENDIAN support https://github.com/stbrumme/crc32/commit/389309c1dc434444dedb67bc0674ef35c4769ce5 let me write a quick test to validate the lib

Apr 12 '25 08:04 shouyangli

The code does have explicit Big Endian support:

uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32)
{
  uint32_t  crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
  const uint32_t* current = (const uint32_t*) data;

  // process four bytes at once (Slicing-by-4)
  while (length >= 4)
  {
#if __BYTE_ORDER == __BIG_ENDIAN
    uint32_t one = *current++ ^ swap(crc);
    crc = Crc32Lookup[0][ one      & 0xFF] ^
          Crc32Lookup[1][(one>> 8) & 0xFF] ^
          Crc32Lookup[2][(one>>16) & 0xFF] ^
          Crc32Lookup[3][(one>>24) & 0xFF];
#else
    uint32_t one = *current++ ^ crc;
    crc = Crc32Lookup[0][(one>>24) & 0xFF] ^
          Crc32Lookup[1][(one>>16) & 0xFF] ^
          Crc32Lookup[2][(one>> 8) & 0xFF] ^
          Crc32Lookup[3][ one      & 0xFF];
#endif

    length -= 4;
  }

The risk it breaks Big Endian should be small. I just merged, thanks for the PR.

I am currently outside, I will make a double check of Big Endian before publishing a new release.

May 03 '25 20:05 wangyu-

The code does have explicit Big Endian support:

uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32)
{
  uint32_t  crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
  const uint32_t* current = (const uint32_t*) data;

  // process four bytes at once (Slicing-by-4)
  while (length >= 4)
  {
#if __BYTE_ORDER == __BIG_ENDIAN
    uint32_t one = *current++ ^ swap(crc);
    crc = Crc32Lookup[0][ one      & 0xFF] ^
          Crc32Lookup[1][(one>> 8) & 0xFF] ^
          Crc32Lookup[2][(one>>16) & 0xFF] ^
          Crc32Lookup[3][(one>>24) & 0xFF];
#else
    uint32_t one = *current++ ^ crc;
    crc = Crc32Lookup[0][(one>>24) & 0xFF] ^
          Crc32Lookup[1][(one>>16) & 0xFF] ^
          Crc32Lookup[2][(one>> 8) & 0xFF] ^
          Crc32Lookup[3][ one      & 0xFF];
#endif

    length -= 4;
  }

The risk it breaks Big Endian should be small. I just merged, thanks for the PR.

I am currently outside, I will make a double check of Big Endian before publishing a new release.

Hey @wangyu- , when I tested this it needs CMakeLists.txt edited to add the new file as well. Also, I saw that address sanitizer was enabled by default which causes a big performance hit. See https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993 thanks!

May 03 '25 21:05 tofurky

I will add the crc32/Crc32.cpp to CMakeLists.txt.

the CMakeLists.txt is not suggested to be used, see:

#note: experimental
#      currently only used for generating `compile_commands.json` for clangd
#      to build this project, it's suggested to use `makefile` instead

cmake_minimum_required(VERSION 3.7)
project(speederv2)

May 03 '25 21:05 wangyu-

Oh got it, yes I missed that comment. That explains the sanitizer setting as well.

May 03 '25 21:05 tofurky

@wangyu- Is it possible to release a new version so this optimization can be included in distros?

Jun 15 '25 17:06 skylar-byte

Import fast crc32 from Stephan Brumme

Description

Performance Test

Setup

Script used

Simulating delay and loss

UDPspeeder command lines

iperf command lines

Test results

BIG ENDIAN Validation

Flame Graph