clang performance
on x86_64 I bench gcc9 and clang8/9 and I see that clang has poor performance Do you know that behaviour? Is there some options to add in the clang case?
clang++-9 -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks -pthread -Wl,--no-as-needed -lrt
$ ./benchmarks
|---------------- Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.43x | 0.17x
Raw remove | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 1.97x | 0.41x
Raw empty remove | 0.0027s | 0.0011s | 0.0016s | 0.0015s | 0.0027s | 0.0012s | 0.0017s | 0.0016s | 0.0027s | 0.0012s | 0.0017s | 0.0016s | 0.62x | 0.59x
Single-threaded | 0.0043s | 0.0047s | 0.0039s | 0.0038s | 0.0043s | 0.0047s | 0.0039s | 0.0039s | 0.0043s | 0.0047s | 0.0039s | 0.0039s | 0.91x | 0.90x
Mostly add | 0.0067s | 0.0176s | 0.0053s | 0.0056s | 0.0068s | 0.0195s | 0.0060s | 0.0057s | 0.0068s | 0.0187s | 0.0058s | 0.0057s | 0.85x | 0.84x
Mostly remove | 0.0041s | 0.0059s | 0.0038s | 0.0043s | 0.0042s | 0.0060s | 0.0040s | 0.0044s | 0.0042s | 0.0059s | 0.0039s | 0.0043s | 0.93x | 1.03x
Heavy concurrent | 0.0092s | 0.0171s | 0.0046s | 0.0044s | 0.0263s | 0.0721s | 0.0047s | 0.0079s | 0.0179s | 0.0557s | 0.0047s | 0.0069s | 0.26x | 0.39x
Random concurrent | 0.0103s | 0.0133s | 0.0101s | 0.0103s | 0.0103s | 0.0135s | 0.0101s | 0.0104s | 0.0103s | 0.0134s | 0.0101s | 0.0103s | 0.98x | 1.00x
Average ops/s:
ReaderWriterQueue: 260.27 million
BlockingReaderWriterCircularBuffer: 275.78 million
SPSC queue: 295.60 million
Folly queue: 562.96 million
g++ -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks -pthread -Wl,--no-as-needed -lrt
$ ./benchmarks
|---------------- Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add | 0.0001s | 0.0013s | 0.0002s | 0.0002s | 0.0001s | 0.0013s | 0.0003s | 0.0002s | 0.0001s | 0.0013s | 0.0003s | 0.0002s | 1.76x | 1.10x
Raw remove | 0.0002s | 0.0010s | 0.0002s | 0.0002s | 0.0002s | 0.0010s | 0.0003s | 0.0002s | 0.0002s | 0.0010s | 0.0003s | 0.0002s | 1.63x | 1.24x
Raw empty remove | 0.0022s | 0.0009s | 0.0016s | 0.0011s | 0.0022s | 0.0009s | 0.0017s | 0.0011s | 0.0022s | 0.0009s | 0.0017s | 0.0011s | 0.76x | 0.49x
Single-threaded | 0.0046s | 0.0054s | 0.0045s | 0.0045s | 0.0046s | 0.0054s | 0.0046s | 0.0045s | 0.0046s | 0.0054s | 0.0045s | 0.0045s | 0.99x | 0.99x
Mostly add | 0.0022s | 0.0170s | 0.0046s | 0.0048s | 0.0023s | 0.0170s | 0.0055s | 0.0049s | 0.0023s | 0.0170s | 0.0050s | 0.0049s | 2.23x | 2.16x
Mostly remove | 0.0042s | 0.0046s | 0.0041s | 0.0033s | 0.0042s | 0.0053s | 0.0044s | 0.0034s | 0.0042s | 0.0048s | 0.0043s | 0.0034s | 1.02x | 0.80x
Heavy concurrent | 0.0018s | 0.0150s | 0.0048s | 0.0115s | 0.0019s | 0.0256s | 0.0050s | 0.0168s | 0.0018s | 0.0190s | 0.0049s | 0.0149s | 2.68x | 8.09x
Random concurrent | 0.0127s | 0.0158s | 0.0130s | 0.0130s | 0.0128s | 0.0161s | 0.0130s | 0.0131s | 0.0128s | 0.0160s | 0.0130s | 0.0130s | 1.02x | 1.02x
Average ops/s:
ReaderWriterQueue: 504.36 million
BlockingReaderWriterCircularBuffer: 330.05 million
SPSC queue: 293.11 million
Folly queue: 452.94 million
```
Interesting. Maybe try -Os -fomit-frame-pointer instead of -O3? I'd have to look at the disassembly to see what's different.
No difference with -Os -fomit-frame-pointer. I looked a little more into it, but it's hard to tell exactly what's going on in the context of the full benchmark. In isolation, a simple "raw add" test performs very similarly between clang and gcc. The overall benchmark seems to vary between runs as well, although that might be because I'm running on a VM in the cloud.
I would suggest benchmarking a mock-up of your particular use case with both clang and gcc to see if in your particular case there's a stark difference in performance or not. In my experience, clang's optimizations tend to be hit-or-miss, and can vary depending on the surrounding context.
thanks for reply, I can't investigate for the moment, I'll try later