Speed drop when running oneDNN in a subthread
Summary
Speed drop when running oneDNN in a subthread.
Version
oneDNN 3.4.2 with GNU OpenMP (4.5)
Environment
- CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
- OS version:
3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - Compiler version: 11.3.0
Steps to reproduce
See attached main.cpp. main.cpp.txt
Observed behavior
We observe a ~13% speed drop when we perform a oneDNN matmul in a subthread.
matmul (ms) matmul async (ms)
(30000, 512) * (512, 2) 0.4086 0.4505
Expected behavior
We would expect a similar speed when running in a subthread.
Hi @matiaslin/ From the reproducer shared, it seems that observations based on a single time measurement. Have you tried to perform multiple runs to stabilize performance numbers?
I can suggest to reuse this example, build an asynchronous execution and verify using the proposed methodology. Thanks.
Thanks for the recommendation, @dzarukin. I believe I do execute the matmul primitive multiple times to obtain the results posted above. In the TIME macro, we repeat the expression REPEAT times. Is this what you are referring to?
it is, yes. It seems I filtered macro-style variable out... Couple more questions: why do you benchmark creation + execution, is it intended? What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?
Just to be sure - is it a single threaded version?
why do you benchmark creation + execution, is it intended?
we dont: see that the TIME() macro only measure time for execution
What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?
because the TIME macro only repeat and measure the prim exec. Best
Please attach the ONEDNN_VERBOSE=1 logs with 20 iteration for both modes.
I ran with ONEDNN_VERBOSE=1 and REPEAT is set to 20 for both modes. See attached verbose.log. Thank you! verbose.log
As oneDNN execution doesn't change for two different modes, and given there are 8 threads, I would expect the async mode introduces a resource issue and eventually over-subscription or something along the lines due to std threading is OMP-unaware.
I'd expect those numbers to be aligned once oneDNN is built with a sequential runtime (with some potential delta due to async overhead).
Tks @dzarukin
We ll test with the SEQ build of 1dnn but our goal is though to have a normal speed when using 1dnn OMP in a subthread.
Could you please run that main.cpp via Intel vtune and upload somewhere the results ? We would do it but these Xeon(R) Platinum 8375C CPUs are on the cloud meaning under an hypervisor which prevents to get meaningful reports from vtune.
Best
@onednnsupporttriage, would you be able to help with this request?
@matiaslin Thanks for reaching out. We did some experiments with your sample code. These are our findings: -Increasing the number of threads helps alleviate this performance gap with subthread -With the 8 threads you can use the following OMP configurations:
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
numactl --membind 0 --cpunodebind 0 ./main
I have tried to run on Xeon 4th generation CPUs and saw similar performances (non-async vs async) with the abovementioned setup.
Closing as stale. Feel free to reopen with additional data.