oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

Speed drop when running oneDNN in a subthread

Open matiaslin opened this issue 1 year ago • 10 comments

Summary

Speed drop when running oneDNN in a subthread.

Version

oneDNN 3.4.2 with GNU OpenMP (4.5)

Environment

  • CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
  • OS version: 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler version: 11.3.0

Steps to reproduce

See attached main.cpp. main.cpp.txt

Observed behavior

We observe a ~13% speed drop when we perform a oneDNN matmul in a subthread.

                                  matmul (ms)      matmul async (ms)
(30000, 512) * (512, 2)            0.4086            0.4505

Expected behavior

We would expect a similar speed when running in a subthread.

matiaslin avatar Jul 16 '24 23:07 matiaslin

Hi @matiaslin/ From the reproducer shared, it seems that observations based on a single time measurement. Have you tried to perform multiple runs to stabilize performance numbers?

I can suggest to reuse this example, build an asynchronous execution and verify using the proposed methodology. Thanks.

dzarukin avatar Jul 16 '24 23:07 dzarukin

Thanks for the recommendation, @dzarukin. I believe I do execute the matmul primitive multiple times to obtain the results posted above. In the TIME macro, we repeat the expression REPEAT times. Is this what you are referring to?

matiaslin avatar Jul 16 '24 23:07 matiaslin

it is, yes. It seems I filtered macro-style variable out... Couple more questions: why do you benchmark creation + execution, is it intended? What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?

Just to be sure - is it a single threaded version?

dzarukin avatar Jul 16 '24 23:07 dzarukin

why do you benchmark creation + execution, is it intended?

we dont: see that the TIME() macro only measure time for execution

What would be the proof that unintended timing is coming from the library and not from std::async overhead which would spend time to submit the code and then execute it?

because the TIME macro only repeat and measure the prim exec. Best

WilliamTambellini avatar Jul 17 '24 05:07 WilliamTambellini

Please attach the ONEDNN_VERBOSE=1 logs with 20 iteration for both modes.

dzarukin avatar Jul 17 '24 05:07 dzarukin

I ran with ONEDNN_VERBOSE=1 and REPEAT is set to 20 for both modes. See attached verbose.log. Thank you! verbose.log

matiaslin avatar Jul 17 '24 15:07 matiaslin

As oneDNN execution doesn't change for two different modes, and given there are 8 threads, I would expect the async mode introduces a resource issue and eventually over-subscription or something along the lines due to std threading is OMP-unaware.

I'd expect those numbers to be aligned once oneDNN is built with a sequential runtime (with some potential delta due to async overhead).

dzarukin avatar Jul 17 '24 16:07 dzarukin

Tks @dzarukin
We ll test with the SEQ build of 1dnn but our goal is though to have a normal speed when using 1dnn OMP in a subthread. Could you please run that main.cpp via Intel vtune and upload somewhere the results ? We would do it but these Xeon(R) Platinum 8375C CPUs are on the cloud meaning under an hypervisor which prevents to get meaningful reports from vtune. Best

WilliamTambellini avatar Jul 17 '24 17:07 WilliamTambellini

@onednnsupporttriage, would you be able to help with this request?

vpirogov avatar Jul 18 '24 20:07 vpirogov

@matiaslin Thanks for reaching out. We did some experiments with your sample code. These are our findings: -Increasing the number of threads helps alleviate this performance gap with subthread -With the 8 threads you can use the following OMP configurations:

export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
numactl --membind 0 --cpunodebind 0 ./main

I have tried to run on Xeon 4th generation CPUs and saw similar performances (non-async vs async) with the abovementioned setup.

rupakroyintel avatar Jul 23 '24 14:07 rupakroyintel

Closing as stale. Feel free to reopen with additional data.

vpirogov avatar Aug 28 '24 21:08 vpirogov