UDPspeeder icon indicating copy to clipboard operation
UDPspeeder copied to clipboard

Import fast crc32 from Stephan Brumme

Open lsylsy2 opened this issue 1 year ago • 15 comments

Description

After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage. Replacing it with an open source faster implementation can make significant improvement in performance.

Performance Test

Setup

  1. UDPspeeder client: running on test machine A is a 1Core 1GiB VM running Debian 12 on my Proxmox VE NAS, which is mostly idling during the test, and have a Ryzen 3 4350G CPU.
  2. UDPspeeder server and iperf3 server: running on server B is running Debian 12 in a VM running on my Windows Hyper-V PC.
  3. ipserf3 client: running on my PC directly which also hosts server B.

Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.

Script used

Simulating delay and loss

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms loss 0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.133 flowid 1:3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.132 flowid 1:3

UDPspeeder command lines

./speederv2_amd64 -s -l0.0.0.0:5301 -r127.0.0.1:5201
time ./speederv2_branch_libev -c -l0.0.0.0:5201 -r192.168.1.133:5301
time ./speederv2_crc -c -l0.0.0.0:5201 -r192.168.1.133:5301

iperf command lines

.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 --length 1400
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 -R --length 1400

Test results

"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.

    branch_libev crc32 PR branch_libev crc32 PR
ping=delay*2 Comments Send   Receive(-R)  
delay0/loss0 100M/length1400 Ideal LAN or same city real    1m4.002s user    0m8.940s sys     0m7.169s real    1m12.985s user    0m5.239s sys     0m6.836s real    1m4.562s user    0m5.966s sys     0m4.461s real    1m10.889s user    0m2.646s sys     0m4.357s
delay10/loss1 100M/length1400 ping=20 with 1% loss A pretty good China-HK/KR/JP network real    1m7.122s user    0m9.165s sys     0m4.124s real    1m10.473s user    0m5.827s sys     0m3.216s real    1m2.657s user    0m5.850s sys     0m4.625s real    1m2.515s user    0m2.633s sys     0m4.435s
delay10/loss1 50M/length200 Testing small packets real    1m14.906s user    0m7.325s sys     0m3.939s real    1m3.129s user    0m5.751s sys     0m3.155s real    1m6.626s user    0m3.138s sys     0m9.732s real    1m2.153s user    0m2.458s sys     0m9.062s
delay30/loss5 100M/length1400 ping=60 with 5% loss A less ideal network within Asia real    1m5.962s user    0m9.277s sys     0m7.061s real    1m3.250s user    0m5.494s sys     0m4.875s real    1m7.642s user    0m6.329s sys     0m4.158s real    1m3.010s user    0m2.457s sys     0m4.704s
delay30/loss5 50M/length200 Testing small packets real    1m33.371s user    0m7.033s sys     0m5.237s real    1m7.538s user    0m5.097s sys     0m3.929s real    1m3.274s user    0m2.006s sys     0m11.125s real    1m3.346s user    0m1.585s sys     0m9.966s
delay80/loss10 50M/length1400 ping=160 with 10% loss Pretty bad network across the Pacific Usually aiming low cost Web browsing real    1m5.898s user    0m4.867s sys     0m2.093s real    1m4.362s user    0m3.054s sys     0m1.760s real    1m9.226s user    0m3.642s sys     0m1.938s real    1m4.579s user    0m1.189s sys     0m2.671s

%9GCR@R$7_U~8NSG{NZ670X

BIG ENDIAN Validation

TODO

Flame Graph

TODO

lsylsy2 avatar Jun 16 '24 14:06 lsylsy2

@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance?

lsylsy2 avatar Jun 16 '24 15:06 lsylsy2

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

wangyu- avatar Jun 17 '24 04:06 wangyu-

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

wangyu- avatar Jun 17 '24 04:06 wangyu-

the best way is probably flame graph:

If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR.

wangyu- avatar Jun 17 '24 05:06 wangyu-

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

wangyu- avatar Jun 17 '24 05:06 wangyu-

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops. BTW, optimizing the XOR encryption can also improve the performance by 3~10% by utilizing 64bit operations, but it's written by me and have not been tested on multiple platforms (and also need to modify to support 32bit systems, etc.), so I'll not submit it very soon. https://github.com/wangyu-/UDPspeeder/compare/branch_libev...lsylsy2:UDPspeeder:2406_optimization

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. https://github.com/stbrumme/crc32/pull/8 Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

lsylsy2 avatar Jun 17 '24 05:06 lsylsy2

branch_libev This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

lsylsy2 avatar Jun 17 '24 05:06 lsylsy2

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

wangyu- avatar Jun 17 '24 06:06 wangyu-

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

wangyu- avatar Jun 17 '24 06:06 wangyu-

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role. I'll try to make more tests and send the result later.

lsylsy2 avatar Jun 17 '24 06:06 lsylsy2

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

Server: Oracle ARM VPS in Osaka Client (running perf and generating this graph): a single-core virtual machine on a mostly idling AMD Ryzen 3 pro 4350g, which single thread performance should be similar to Ryzen 5 3600 or i3-10100. The test was ran that sending and receiving was ran for both 60 seconds, each with 100Mb(it)ps throughput, however I forget if I set the packet size to 200/400 or the default ~1400.

lsylsy2 avatar Jun 17 '24 06:06 lsylsy2

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Forgot to say. tc and netem is acutally easier on simulate packet loss.

Here is some example code piece:

DEV=ens5

# turn driver optimizations off
sudo ethtool -K $DEV gro off
sudo ethtool -K $DEV tso off
sudo ethtool -K $DEV gso off

sudo tc qdisc del dev $DEV root
sudo tc qdisc add dev $DEV root netem loss 5.5%

(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo)

wangyu- avatar Jun 17 '24 06:06 wangyu-

Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64). Later adding flame graph comparison and BIG ENDIAN validation

lsylsy2 avatar Jun 22 '24 09:06 lsylsy2

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
         tunnel_client.cpp
         tunnel_server.cpp
         my_ev.cpp
+       crc32/Crc32.cpp
 )
 set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead.

tofurky avatar Oct 10 '24 18:10 tofurky

Thank you for the response and improvement, I was busy on private work and not done the BIG ENDIAN test. Let me see if I can try and include the other improvements

tofurky @.***> 于2024年10月11日周五 02:32写道:

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake :

diff --git a/CMakeLists.txt b/CMakeLists.txt index d6b11ef..ca34fe0 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,7 @@ set(SOURCE_FILES tunnel_client.cpp tunnel_server.cpp my_ev.cpp

  •   crc32/Crc32.cpp
    

) set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

— Reply to this email directly, view it on GitHub https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUTWIR2HGVOFANR2PHEITZ23B23AVCNFSM6AAAAABJMU6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVG44DIOJZGM . You are receiving this because you were mentioned.Message ID: @.***>

lsylsy2 avatar Oct 12 '24 04:10 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

xAJx1383 avatar Jan 02 '25 07:01 xAJx1383

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

By installing qemu-user-static (as simple as apt install qemu-user-static in debian), all linux builds can be easily run on amd64 PC. Let me do the test soon

Edit: user qemu-user, not the old qemu-user-static

lsylsy2 avatar Mar 21 '25 06:03 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?

lsylsy2 avatar Mar 21 '25 06:03 lsylsy2

you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.

Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?

they heavily optimized it for x86 environments but there is some basic optimization which affects all environments. for x86 because there is many SMID extension they tried to take full advantage of that

xAJx1383 avatar Mar 23 '25 08:03 xAJx1383

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

Before further tests: The lib has a github mirror and it has BIG ENDIAN support https://github.com/stbrumme/crc32/commit/389309c1dc434444dedb67bc0674ef35c4769ce5 let me write a quick test to validate the lib

shouyangli avatar Apr 12 '25 08:04 shouyangli

The code does have explicit Big Endian support:

uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32)
{
  uint32_t  crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
  const uint32_t* current = (const uint32_t*) data;

  // process four bytes at once (Slicing-by-4)
  while (length >= 4)
  {
#if __BYTE_ORDER == __BIG_ENDIAN
    uint32_t one = *current++ ^ swap(crc);
    crc = Crc32Lookup[0][ one      & 0xFF] ^
          Crc32Lookup[1][(one>> 8) & 0xFF] ^
          Crc32Lookup[2][(one>>16) & 0xFF] ^
          Crc32Lookup[3][(one>>24) & 0xFF];
#else
    uint32_t one = *current++ ^ crc;
    crc = Crc32Lookup[0][(one>>24) & 0xFF] ^
          Crc32Lookup[1][(one>>16) & 0xFF] ^
          Crc32Lookup[2][(one>> 8) & 0xFF] ^
          Crc32Lookup[3][ one      & 0xFF];
#endif

    length -= 4;
  }

The risk it breaks Big Endian should be small. I just merged, thanks for the PR.

I am currently outside, I will make a double check of Big Endian before publishing a new release.

wangyu- avatar May 03 '25 20:05 wangyu-

The code does have explicit Big Endian support:

uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32)
{
  uint32_t  crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
  const uint32_t* current = (const uint32_t*) data;

  // process four bytes at once (Slicing-by-4)
  while (length >= 4)
  {
#if __BYTE_ORDER == __BIG_ENDIAN
    uint32_t one = *current++ ^ swap(crc);
    crc = Crc32Lookup[0][ one      & 0xFF] ^
          Crc32Lookup[1][(one>> 8) & 0xFF] ^
          Crc32Lookup[2][(one>>16) & 0xFF] ^
          Crc32Lookup[3][(one>>24) & 0xFF];
#else
    uint32_t one = *current++ ^ crc;
    crc = Crc32Lookup[0][(one>>24) & 0xFF] ^
          Crc32Lookup[1][(one>>16) & 0xFF] ^
          Crc32Lookup[2][(one>> 8) & 0xFF] ^
          Crc32Lookup[3][ one      & 0xFF];
#endif

    length -= 4;
  }

The risk it breaks Big Endian should be small. I just merged, thanks for the PR.

I am currently outside, I will make a double check of Big Endian before publishing a new release.

Hey @wangyu- , when I tested this it needs CMakeLists.txt edited to add the new file as well. Also, I saw that address sanitizer was enabled by default which causes a big performance hit. See https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993 thanks!

tofurky avatar May 03 '25 21:05 tofurky

I will add the crc32/Crc32.cpp to CMakeLists.txt.

the CMakeLists.txt is not suggested to be used, see:

#note: experimental
#      currently only used for generating `compile_commands.json` for clangd
#      to build this project, it's suggested to use `makefile` instead

cmake_minimum_required(VERSION 3.7)
project(speederv2)

wangyu- avatar May 03 '25 21:05 wangyu-

Oh got it, yes I missed that comment. That explains the sanitizer setting as well.

tofurky avatar May 03 '25 21:05 tofurky

@wangyu- Is it possible to release a new version so this optimization can be included in distros?

skylar-byte avatar Jun 15 '25 17:06 skylar-byte