Import fast crc32 from Stephan Brumme
Description
After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage. Replacing it with an open source faster implementation can make significant improvement in performance.
Performance Test
Setup
- UDPspeeder client: running on test machine A is a 1Core 1GiB VM running Debian 12 on my Proxmox VE NAS, which is mostly idling during the test, and have a Ryzen 3 4350G CPU.
- UDPspeeder server and iperf3 server: running on server B is running Debian 12 in a VM running on my Windows Hyper-V PC.
- ipserf3 client: running on my PC directly which also hosts server B.
Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.
Script used
Simulating delay and loss
tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms loss 0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.133 flowid 1:3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.132 flowid 1:3
UDPspeeder command lines
./speederv2_amd64 -s -l0.0.0.0:5301 -r127.0.0.1:5201
time ./speederv2_branch_libev -c -l0.0.0.0:5201 -r192.168.1.133:5301
time ./speederv2_crc -c -l0.0.0.0:5201 -r192.168.1.133:5301
iperf command lines
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 --length 1400
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 -R --length 1400
Test results
"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.
| branch_libev | crc32 PR | branch_libev | crc32 PR | ||
|---|---|---|---|---|---|
| ping=delay*2 | Comments | Send | Receive(-R) | ||
| delay0/loss0 100M/length1400 | Ideal LAN or same city | real 1m4.002s user 0m8.940s sys 0m7.169s | real 1m12.985s user 0m5.239s sys 0m6.836s | real 1m4.562s user 0m5.966s sys 0m4.461s | real 1m10.889s user 0m2.646s sys 0m4.357s |
| delay10/loss1 100M/length1400 | ping=20 with 1% loss A pretty good China-HK/KR/JP network | real 1m7.122s user 0m9.165s sys 0m4.124s | real 1m10.473s user 0m5.827s sys 0m3.216s | real 1m2.657s user 0m5.850s sys 0m4.625s | real 1m2.515s user 0m2.633s sys 0m4.435s |
| delay10/loss1 50M/length200 | Testing small packets | real 1m14.906s user 0m7.325s sys 0m3.939s | real 1m3.129s user 0m5.751s sys 0m3.155s | real 1m6.626s user 0m3.138s sys 0m9.732s | real 1m2.153s user 0m2.458s sys 0m9.062s |
| delay30/loss5 100M/length1400 | ping=60 with 5% loss A less ideal network within Asia | real 1m5.962s user 0m9.277s sys 0m7.061s | real 1m3.250s user 0m5.494s sys 0m4.875s | real 1m7.642s user 0m6.329s sys 0m4.158s | real 1m3.010s user 0m2.457s sys 0m4.704s |
| delay30/loss5 50M/length200 | Testing small packets | real 1m33.371s user 0m7.033s sys 0m5.237s | real 1m7.538s user 0m5.097s sys 0m3.929s | real 1m3.274s user 0m2.006s sys 0m11.125s | real 1m3.346s user 0m1.585s sys 0m9.966s |
| delay80/loss10 50M/length1400 | ping=160 with 10% loss Pretty bad network across the Pacific Usually aiming low cost Web browsing | real 1m5.898s user 0m4.867s sys 0m2.093s | real 1m4.362s user 0m3.054s sys 0m1.760s | real 1m9.226s user 0m3.642s sys 0m1.938s | real 1m4.579s user 0m1.189s sys 0m2.671s |
BIG ENDIAN Validation
TODO
Flame Graph
TODO
@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance?
@lsylsy2 Hi, thanks for the PR.
For peformance measuring:
the best way is probably flame graph, here is an example for udp2raw I did previously:
you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.
IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.
(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)
the best way is probably flame graph:
If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR.
From the source code, looks like the author has already considered the case of BIG ENDIAN systems.
Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?
@lsylsy2 Hi, thanks for the PR.
For peformance measuring:
the best way is probably flame graph, here is an example for udp2raw I did previously:
you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.
IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.
(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)
I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops. BTW, optimizing the XOR encryption can also improve the performance by 3~10% by utilizing 64bit operations, but it's written by me and have not been tested on multiple platforms (and also need to modify to support 32bit systems, etc.), so I'll not submit it very soon. https://github.com/wangyu-/UDPspeeder/compare/branch_libev...lsylsy2:UDPspeeder:2406_optimization
From the source code, looks like the author has already considered the case of BIG ENDIAN systems.
Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?
I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. https://github.com/stbrumme/crc32/pull/8 Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.
This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.
However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.
I think you idea works.
Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.
Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.
Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.
This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.
Intereseting. Is this on the sending end or receiver end?
If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.
IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.
(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)
FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role. I'll try to make more tests and send the result later.
This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.
Intereseting. Is this on the sending end or receiver end?
If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.
Server: Oracle ARM VPS in Osaka Client (running perf and generating this graph): a single-core virtual machine on a mostly idling AMD Ryzen 3 pro 4350g, which single thread performance should be similar to Ryzen 5 3600 or i3-10100. The test was ran that sending and receiving was ran for both 60 seconds, each with 100Mb(it)ps throughput, however I forget if I set the packet size to 200/400 or the default ~1400.
Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.
Forgot to say. tc and netem is acutally easier on simulate packet loss.
Here is some example code piece:
DEV=ens5
# turn driver optimizations off
sudo ethtool -K $DEV gro off
sudo ethtool -K $DEV tso off
sudo ethtool -K $DEV gso off
sudo tc qdisc del dev $DEV root
sudo tc qdisc add dev $DEV root netem loss 5.5%
(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo)
Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64). Later adding flame graph comparison and BIG ENDIAN validation
hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.
note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
tunnel_client.cpp
tunnel_server.cpp
my_ev.cpp
+ crc32/Crc32.cpp
)
set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")
edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead.
Thank you for the response and improvement, I was busy on private work and not done the BIG ENDIAN test. Let me see if I can try and include the other improvements
tofurky @.***> 于2024年10月11日周五 02:32写道:
hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.
note that there is a small change needed to fix compilation with cmake :
diff --git a/CMakeLists.txt b/CMakeLists.txt index d6b11ef..ca34fe0 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,7 @@ set(SOURCE_FILES tunnel_client.cpp tunnel_server.cpp my_ev.cpp
crc32/Crc32.cpp) set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")
— Reply to this email directly, view it on GitHub https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUTWIR2HGVOFANR2PHEITZ23B23AVCNFSM6AAAAABJMU6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVG44DIOJZGM . You are receiving this because you were mentioned.Message ID: @.***>
you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.
From the source code, looks like the author has already considered the case of BIG ENDIAN systems.
Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?
By installing qemu-user-static (as simple as apt install qemu-user-static in debian), all linux builds can be easily run on amd64 PC. Let me do the test soon
Edit: user qemu-user, not the old qemu-user-static
you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.
Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?
you could use isa-l library for implementing fast FEC and crc32, in my tests the limiting factor with this library was my ram speed.
Had some research on this, it looks like it's performance is not optimized for non-x64_64 environments?
they heavily optimized it for x86 environments but there is some basic optimization which affects all environments. for x86 because there is many SMID extension they tried to take full advantage of that
However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.
I think you idea works.
Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.
Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.
Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.
Before further tests: The lib has a github mirror and it has BIG ENDIAN support https://github.com/stbrumme/crc32/commit/389309c1dc434444dedb67bc0674ef35c4769ce5 let me write a quick test to validate the lib
The code does have explicit Big Endian support:
uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32)
{
uint32_t crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
const uint32_t* current = (const uint32_t*) data;
// process four bytes at once (Slicing-by-4)
while (length >= 4)
{
#if __BYTE_ORDER == __BIG_ENDIAN
uint32_t one = *current++ ^ swap(crc);
crc = Crc32Lookup[0][ one & 0xFF] ^
Crc32Lookup[1][(one>> 8) & 0xFF] ^
Crc32Lookup[2][(one>>16) & 0xFF] ^
Crc32Lookup[3][(one>>24) & 0xFF];
#else
uint32_t one = *current++ ^ crc;
crc = Crc32Lookup[0][(one>>24) & 0xFF] ^
Crc32Lookup[1][(one>>16) & 0xFF] ^
Crc32Lookup[2][(one>> 8) & 0xFF] ^
Crc32Lookup[3][ one & 0xFF];
#endif
length -= 4;
}
The risk it breaks Big Endian should be small. I just merged, thanks for the PR.
I am currently outside, I will make a double check of Big Endian before publishing a new release.
The code does have explicit Big Endian support:
uint32_t crc32_4bytes(const void* data, size_t length, uint32_t previousCrc32) { uint32_t crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF const uint32_t* current = (const uint32_t*) data; // process four bytes at once (Slicing-by-4) while (length >= 4) { #if __BYTE_ORDER == __BIG_ENDIAN uint32_t one = *current++ ^ swap(crc); crc = Crc32Lookup[0][ one & 0xFF] ^ Crc32Lookup[1][(one>> 8) & 0xFF] ^ Crc32Lookup[2][(one>>16) & 0xFF] ^ Crc32Lookup[3][(one>>24) & 0xFF]; #else uint32_t one = *current++ ^ crc; crc = Crc32Lookup[0][(one>>24) & 0xFF] ^ Crc32Lookup[1][(one>>16) & 0xFF] ^ Crc32Lookup[2][(one>> 8) & 0xFF] ^ Crc32Lookup[3][ one & 0xFF]; #endif length -= 4; }The risk it breaks Big Endian should be small. I just merged, thanks for the PR.
I am currently outside, I will make a double check of Big Endian before publishing a new release.
Hey @wangyu- , when I tested this it needs CMakeLists.txt edited to add the new file as well. Also, I saw that address sanitizer was enabled by default which causes a big performance hit. See https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993 thanks!
I will add the crc32/Crc32.cpp to CMakeLists.txt.
the CMakeLists.txt is not suggested to be used, see:
#note: experimental
# currently only used for generating `compile_commands.json` for clangd
# to build this project, it's suggested to use `makefile` instead
cmake_minimum_required(VERSION 3.7)
project(speederv2)
Oh got it, yes I missed that comment. That explains the sanitizer setting as well.
@wangyu- Is it possible to release a new version so this optimization can be included in distros?