hyperscan icon indicating copy to clipboard operation
hyperscan copied to clipboard

Using hsbench to test avx512 performance is even lower

Open chunshengxiao opened this issue 5 years ago • 8 comments

My CPU is Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz; I built avx512 and sse3 instruction set hyperscan on my machine, and then used hsbench and officially downloaded data to test performance.

The commands I used are as follows:

for avx512:

cmake -DBUILD_AVX512=on -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"  -DFAT_RUNTIME=0 ..
make -j80
#run commond
taskset 1 hsbench -e pcre/snort_literals -c corpora/alexa200.db -V

for sse3 (On the same cpu):

cmake -DCMAKE_C_FLAGS="-march=core2" -DCMAKE_CXX_FLAGS="-march=core2" -DFAT_RUNTIME=0 ..
make -j80
#run commond
taskset 1 hsbench -e pcre/snort_literals -c corpora/alexa200.db -V

The gcc version i have is 7.3.0, and the operating system is ubuntu18.04.

sse3 runs nearly 10% faster than avx512. I don't know if this data is reasonable?

This is the result of my operation:

*** Snort literals against HTTP traffic, block mode.

Signatures:        pcre/snort_literals
Hyperscan info:    Version: 5.2.1 Features: AVX512 Mode: VECTORED
Expression count:  3,116
Bytecode size:     695,608 bytes
Database CRC:      0xe4f2719
Scratch size:      5,479 bytes
Compile time:      0.083 seconds
Peak heap usage:   192,765,952 bytes

Time spent scanning:       7.906 seconds
Corpus size:               177,087,567 bytes (130,957 blocks in 5,400 vectors)
Matches per iteration:     81,963 (0.474 matches/kilobyte)
Overall block rate:        331,268.29 blocks/sec
Mean throughput (overall): 3,583.68 Mbit/sec
Max throughput (per core): 3,767.96 Mbit/sec

*** Snort literals against HTTP traffic, block mode.

Signatures:        pcre/snort_literals
Hyperscan info:    Version: 5.2.1 Features:  Mode: VECTORED
Expression count:  3,116
Bytecode size:     695,608 bytes
Database CRC:      0xe4f2719
Scratch size:      5,479 bytes
Compile time:      0.085 seconds
Peak heap usage:   193,003,520 bytes

Time spent scanning:       6.730 seconds
Corpus size:               177,087,567 bytes (130,957 blocks in 5,400 vectors)
Matches per iteration:     81,963 (0.474 matches/kilobyte)
Overall block rate:        389,196.47 blocks/sec
Mean throughput (overall): 4,210.35 Mbit/sec
Max throughput (per core): 4,438.81 Mbit/sec

chunshengxiao avatar Jan 20 '20 02:01 chunshengxiao

Hi, your result under AVX512 shows nearly 15% performance drop against SSSE3, which seems too much to me. Actually this case only touched the large scale multi-literal matching part in Hyperscan, which now doesn't have any AVX2/AVX512 optimizations on it, as you can see, the bytecode sizes and CRCs are exactly the same, they're building and running same engines, with same runtime implementations, so same performances are expected. May AVX512 cause little performance drop due to frequency drop, but 15% is too much. I ran your commands on my server and saw AVX512 has 0.7% performance drop against SSSE3, which is reasonable to me, suggest you run the test again.

fatchanghao avatar Jan 20 '20 04:01 fatchanghao

Hello, I retested, but it doesn't reach your performance difference of about 0.7%, and the final result is still around 15%.

My code is downloaded from the master branch, and the use case comes from data

Are we using the same source code and data set?

chunshengxiao avatar Jan 20 '20 08:01 chunshengxiao

I believe we're using the same code, rule and corpus, because the bytecode CRC, corpus size and match rate are all the same, my result under AVX512 is as follows:

Signatures: ../signatures/HSBench/pcre/snort_literals Hyperscan info: Version: 5.2.1 Features: AVX512 Mode: VECTORED Expression count: 3,116 Bytecode size: 695,608 bytes Database CRC: 0xe4f2719 Scratch size: 5,479 bytes Compile time: 0.075 seconds Peak heap usage: 192,999,424 bytes

Time spent scanning: 5.645 seconds Corpus size: 177,087,567 bytes (130,957 blocks in 5,400 vectors) Matches per iteration: 81,963 (0.474 matches/kilobyte) Overall block rate: 463,951.77 blocks/sec Mean throughput (overall): 5,019.06 Mbit/sec Max throughput (per core): 5,148.03 Mbit/sec

At this moment I've no idea about how it could happen, we will have some further investigation then. 15% drop seems like overhead from assertions, but it shouldn't appear here.

fatchanghao avatar Jan 20 '20 08:01 fatchanghao

I don‘t ’think it's a matter of assertion.
Because we did not add the debug option. In addition, when I trace my program with gdb, the assert statement is not executed.

chunshengxiao avatar Jan 20 '20 09:01 chunshengxiao

OK, so the problem should be elsewhere.

fatchanghao avatar Jan 20 '20 09:01 fatchanghao

@fatchanghao I will be very grateful if you could synchronize the cpu, gcc and other related information you used for testing, I want to verify if it is the cause of other interference factors;

chunshengxiao avatar Jan 20 '20 12:01 chunshengxiao

@fatchanghao I will be very grateful if you could synchronize the cpu, gcc and other related information you used for testing, I want to verify if it is the cause of other interference factors;

Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz, gcc 7.2.0 Ubuntu17.10 Hyperscan 5.2.1 building commands, rules, corpus are same as yours.

fatchanghao avatar Jan 21 '20 00:01 fatchanghao

I found a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, gcc 7.4.0, Ubuntu 18.04. Benchmark on this platform showed AVX512 has 1.8% performance drop.

fatchanghao avatar Jan 21 '20 03:01 fatchanghao