Summary

Speed drop when running oneDNN in a subthread.

Version

oneDNN 3.4.2 with GNU OpenMP (4.5)

Environment

CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
OS version: 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Compiler version: 11.3.0

Steps to reproduce

See attached main.cpp. main.cpp.txt

Observed behavior

We observe a ~13% speed drop when we perform a oneDNN matmul in a subthread.

                                  matmul (ms)      matmul async (ms)
(30000, 512) * (512, 2)            0.4086            0.4505

Expected behavior

We would expect a similar speed when running in a subthread.

Jul 16 '24 23:07 matiaslin

Hi @matiaslin/ From the reproducer shared, it seems that observations based on a single time measurement. Have you tried to perform multiple runs to stabilize performance numbers?

I can suggest to reuse this example, build an asynchronous execution and verify using the proposed methodology. Thanks.

Jul 16 '24 23:07 dzarukin

Thanks for the recommendation, @dzarukin. I believe I do execute the matmul primitive multiple times to obtain the results posted above. In the TIME macro, we repeat the expression REPEAT times. Is this what you are referring to?

Jul 16 '24 23:07 matiaslin

it is, yes. It seems I filtered macro-style variable out... Couple more questions: why do you benchmark creation + execution, is it intended? What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?

Just to be sure - is it a single threaded version?

Jul 16 '24 23:07 dzarukin

why do you benchmark creation + execution, is it intended?

we dont: see that the TIME() macro only measure time for execution

What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?

because the TIME macro only repeat and measure the prim exec. Best

Jul 17 '24 05:07 WilliamTambellini

Please attach the ONEDNN_VERBOSE=1 logs with 20 iteration for both modes.

Jul 17 '24 05:07 dzarukin

I ran with ONEDNN_VERBOSE=1 and REPEAT is set to 20 for both modes. See attached verbose.log. Thank you! verbose.log

Jul 17 '24 15:07 matiaslin

As oneDNN execution doesn't change for two different modes, and given there are 8 threads, I would expect the async mode introduces a resource issue and eventually over-subscription or something along the lines due to std threading is OMP-unaware.

I'd expect those numbers to be aligned once oneDNN is built with a sequential runtime (with some potential delta due to async overhead).

Jul 17 '24 16:07 dzarukin

Tks @dzarukin
We ll test with the SEQ build of 1dnn but our goal is though to have a normal speed when using 1dnn OMP in a subthread. Could you please run that main.cpp via Intel vtune and upload somewhere the results ? We would do it but these Xeon(R) Platinum 8375C CPUs are on the cloud meaning under an hypervisor which prevents to get meaningful reports from vtune. Best

Jul 17 '24 17:07 WilliamTambellini

@onednnsupporttriage, would you be able to help with this request?

Jul 18 '24 20:07 vpirogov

@matiaslin Thanks for reaching out. We did some experiments with your sample code. These are our findings: -Increasing the number of threads helps alleviate this performance gap with subthread -With the 8 threads you can use the following OMP configurations:

export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
numactl --membind 0 --cpunodebind 0 ./main

I have tried to run on Xeon 4th generation CPUs and saw similar performances (non-async vs async) with the abovementioned setup.

Jul 23 '24 14:07 rupakroyintel

Closing as stale. Feel free to reopen with additional data.

Aug 28 '24 21:08 vpirogov