oneDNN
oneDNN copied to clipboard
Performance regression from v1.4 to v2.6
Summary
I have benchmarked various standard deep learning networks such as AlexNet, GoogleNet, ResNet50, and MobileNet-V2. I have observed that OneDNN v2.6 performs slower compared with v1.4.
Version
v2.6
Environment
oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:
- CPU :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz
Stepping: 4
CPU MHz: 1200.451
CPU max MHz: 3900.0000
CPU min MHz: 1200.0000
BogoMIPS: 7200.00
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 6 MiB
L3 cache: 8.3 MiB
NUMA node0 CPU(s): 0-11
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xt opology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra nd lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 e rms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm i da arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d arch_capabilities
- OS version :
Linux <username> 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
- Compiler version:
gcc (Debian 10.2.1-6) 10.2.1 20210110
- CMake version:
3.18.4
Steps to reproduce
Benchmarked GoogleNet for v2.6 against v2.6. I have saved verbose output for both versions. Please find the verbose output the below log files: v1_4.txt v2_6.txt
Observed behavior

Hi,
I have analyzed further using the verbose logs. Please find the summary of the primitives created using the verbose logs from this link.
For v1.4, we always choose blocked layouts such as nChw8c, and nChw16c as optimal formats. OneDNN v1.4 also always chooses these blocked layouts(nChw8c/nChw16c) as the default optimal formats for convolution based on the architecture(nChw8c for avx2 and nChw16c for avx512). So, for some of the OneDNN layers(lrn, pool, concat., Etc), If input memory coming from the previous layer is not in the blocked layout(nChw8c/nChw16c), then I will reorder the input memory to the blocked layout(nChw8c/nChw16c) based on the architecture. This strategy helped in performance improvement.
OneDNN v2.6, nhwc was chosen as the optimal format on AVX512 architecture. I have observed convolution operation also chosen nhwc instead of blocked layout(nChw16c). But, I still assumed blocked layouts(nChw16c) as optimal formats, and this assumption caused input reorders(nChw16c-->nhwc) in convolution in v2.6.
Below are my questions?
- How does OneDNN choose the optimal format for operations like convolution? For v2.6, is it always nhwc on avx512?
- Why nhwc instead of blocked layout(nChw16c) for convolution on AVX512 architecture? Did we see any performance gain using the nhwc. In my observation, blocked layout(nChw8c) performed better compared with nhwc. You can see the results in the screenshot attached in the above comment.
Thanks, Hari
Hi,
+Adding another observation. I did the benchmarking on the avx2 machine with OneDNN version 2.6. I found that on the avx2 machine, v2.6 convolution chooses the optimal output format as 'nChw8c' which was the same behavior in older versions too.
Hi @Hari-MathWorks ,
Thanks for the question.
- First of all the observed behavior is expected. The problem with blocked format like
nChw8c
is that it is not always possible to integrate format propagation into a framework/application. As a result we are trying to provide optimized version of oneDNN primitives usingnhwc
format for data (src, dst). It applies more to the latest architectures (AVX512+) because they have higher priority, this is why you still seenChw8c
on AVX2 machine. Weights should always be kept in formatany
so implementation will choose the best format even if data is innhwc
format. - Underlying format is an implementation detail when format
any
is used, so oneDNN can't guarantee that it will stay the same. We don't recommend application/frameworks to hard-code particular blocked formats for oneDNN primitives, because new implementations might use different formats and it will introduce unnesessary reorders.
Overall If possible use any
for all convolution/ip/matmul memory descriptors so oneDNN could choose the best implementation. If format propagation is not possible for data, keep weights in format any
. The rest of primitives (lrn, pool, concat., etc) should work on whatever format comes from the previous layer (conv/ip/matmul). If we optimize a conv for a particular format (nhwc or nChw8c) we make sure that other primitives work fast on that memory format to provide model-level speed-up.
Regards, Igor
We have an article in oneDNN developer guide on this topic: Memory Format Propagation
Hi @igorsafo and @vpirogov,
Thanks for your response.
I'm not hard-coding memory format in the destination memory for any operation primitive. I always use any
.
But, I will conditionally insert source reorders(reorder source to optimized format i.e. nChw8c and avx2, nChw16c on avx512) for other primitives(lrn, pool, concat.). Below is one of the reasons for doing this:
Let's say we have graph architecture as ConvLayer-->customLayer-->maxPoolLayer
. Here customLayer
is not a oneDNN layer and it will always operate in nchw
format.
-
For
ConvLayer
, I will use the destination memory format as theany
, and let's say OneDNN chooses the destination memory format as thenChw16c
. -
As
customLayer
operates innchw
, it reorders source fromnChw16c
tonchw
and destination memory also in thenchw
layout. -
In the maxPoolLayer, src memory layout will be
nchw
, and pool operation performed on this layout and destination memory also innchw
format. I have observed that the pooling operation performs slower fornchw
source memory compared with blocked layerout(nChw8c/nChw16c) source memory. Belowe are the results:Because of this reason, if the source is not a blocked format, we manually insert a source reorder(nchw-->blocked layerout i.e., nChw8c on avx2 and nChw16c on avx512) operation in layers like maxpool, lrn, concat .,Etc.
-
In
maxPoolLayer
, I have inserted src reorder only because src memory is not in optimal layeout. Suppose, if we haveConvLayer-->maxPoolLayer
then no src reorder will be inserted inmaxPoolLayer
layer. -
So, The problem for us is that I always insert src reorder to blocked layout in the max pool and some other layers if the source is not a blocked layout. For,
v1.4
there is no issue because OneDNN also always chooses a blocked layout as the optimal format. For,v2.6
, more reorders are inserted because OneDNN chooses 'nhwc' as an optimal format other than blocked layouts and still our assumption is blocked layout is the optimal format.
Moreover, I have observed that for convolution operation, nChw16c
layout performs better compared with the nhwc
layout.
You can see Convolution layer timings in the attached table in the observed behavior section in the first comment.
@Hari-MathWorks Thank you for the explanation of the problem!
So if I understand correctly there are 2 issues:
- Due to changed format there are additional reorders.
Is it possible to store the tag used by a convolution so the rest of primitives will use it? So if conv decides to work on
nChw8c
the rest will work on it, if conv decides to work onnhwc
the rest will work on it as well. This way unnesessary reorders will be removed and all primitives will use optimized implementation. - Convolution with
any
in v2.6 is slower than Convolution withany
in v1.4. Could you please provide more details about your configuration? Compiler, threading, CPU SKU, etc, so I can create an internal issue to investigate the regression?
Hi @igorsafo ,
Thanks for your response.
- Regarding additional reorders: We can store the convolution output memory format and we can reuse it. But, the problem is that we can not have a convolution layer in the deep learning network every time. We can have a network without convolution layer. Because of that reason, we use nChw8c for avx2 and nChw16c for avx512 by default as optimal formats for 1.4. To eliminate these reorders, we can follow a new rule i.e., nChw8c for avx2 and nhwc for avx512 as optimal formats for 2.6.
- Regarding regression in v2.6:
I have different configurations for the convolution layer in GoogleNet that were consuming more time compared with v1.4. You can find these configurations from the attached verbose files in the first comment.
You can see the below graph(Created using the timings from the above verbose output text files), where we can see the performance of 39 convolution layers in the GoogleNet in order. You can see here that v2.6 performs slower compared with v1.4 overall.
I have also observed the drop in concat layer also. Please find the below graph for more details(Created using the timings from the above verbose output text files).
Fo the other details like Compiler, threading, CPU, please refer to the environment section in the first comment.
Hi @igorsafo , @vpirogov ,
+Adding another observation. I'm facing this regression problem for RNN networks too. I have created a separate issue(https://github.com/oneapi-src/oneDNN/issues/1415) for RNN.
Hi @Hari-MathWorks , Thank you. I created an internal tracker to investigate performance regression of convolution primitive.
loop @aice-support for MathWorks support
Hi @Hari-MathWorks , I apologize for the delayed response.
Our internal tests showed convolutions in v2.6 (which uses nhwc format) operated faster compared to v1.4 (which uses blocked format). We were able to revert the convolution kernel for oneDNN v2.6 from ‘brgconv’ back to ‘jit_avx512’. This results in forced blocked form layout (nChw16C on avx512), which eliminates the extra reorders and produces similar results to oneDNN v1.4 (see table below).
How to perform kernel modifications:
- Navigate to the cpu_convolution_list.cpp file under oneDNN/src/cpu/
- Open up the file and comment out the following three lines:
line 84: CPU_INSTANCE_AVX512(brdgmm_dw_convolution_fwd_t) line 86: CPU_INSTANCE_AVX512(brgemm_1x1_convolution_fwd_t<avx512_core>) line 87: CPU_INSTANCE_AVX512(brgemm_convolution_fwd_t<avx512_core>)
- Build the oneDNN library again
Please let us know if these modifications resolve the issue.
Regards, Orel