Summary

I have benchmarked various standard deep learning networks such as AlexNet, GoogleNet, ResNet50, and MobileNet-V2. I have observed that OneDNN v2.6 performs slower compared with v1.4.

Version

v2.6

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

CPU : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz Stepping: 4 CPU MHz: 1200.451 CPU max MHz: 3900.0000 CPU min MHz: 1200.0000 BogoMIPS: 7200.00 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 6 MiB L3 cache: 8.3 MiB NUMA node0 CPU(s): 0-11 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xt opology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra nd lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 e rms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm i da arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d arch_capabilities
OS version : Linux <username> 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Compiler version: gcc (Debian 10.2.1-6) 10.2.1 20210110
CMake version: 3.18.4

Steps to reproduce

Benchmarked GoogleNet for v2.6 against v2.6. I have saved verbose output for both versions. Please find the verbose output the below log files: v1_4.txt v2_6.txt

Observed behavior

Jul 02 '22 18:07 Hari-MathWorks

Hi,

I have analyzed further using the verbose logs. Please find the summary of the primitives created using the verbose logs from this link.

For v1.4, we always choose blocked layouts such as nChw8c, and nChw16c as optimal formats. OneDNN v1.4 also always chooses these blocked layouts(nChw8c/nChw16c) as the default optimal formats for convolution based on the architecture(nChw8c for avx2 and nChw16c for avx512). So, for some of the OneDNN layers(lrn, pool, concat., Etc), If input memory coming from the previous layer is not in the blocked layout(nChw8c/nChw16c), then I will reorder the input memory to the blocked layout(nChw8c/nChw16c) based on the architecture. This strategy helped in performance improvement.

OneDNN v2.6, nhwc was chosen as the optimal format on AVX512 architecture. I have observed convolution operation also chosen nhwc instead of blocked layout(nChw16c). But, I still assumed blocked layouts(nChw16c) as optimal formats, and this assumption caused input reorders(nChw16c-->nhwc) in convolution in v2.6.

Below are my questions?

How does OneDNN choose the optimal format for operations like convolution? For v2.6, is it always nhwc on avx512?
Why nhwc instead of blocked layout(nChw16c) for convolution on AVX512 architecture? Did we see any performance gain using the nhwc. In my observation, blocked layout(nChw8c) performed better compared with nhwc. You can see the results in the screenshot attached in the above comment.

Thanks, Hari

Jul 04 '22 16:07 Hari-MathWorks

Hi,

+Adding another observation. I did the benchmarking on the avx2 machine with OneDNN version 2.6. I found that on the avx2 machine, v2.6 convolution chooses the optimal output format as 'nChw8c' which was the same behavior in older versions too.

Jul 05 '22 13:07 Hari-MathWorks

Hi @Hari-MathWorks ,

Thanks for the question.

First of all the observed behavior is expected. The problem with blocked format like nChw8c is that it is not always possible to integrate format propagation into a framework/application. As a result we are trying to provide optimized version of oneDNN primitives using nhwc format for data (src, dst). It applies more to the latest architectures (AVX512+) because they have higher priority, this is why you still see nChw8c on AVX2 machine. Weights should always be kept in format any so implementation will choose the best format even if data is in nhwc format.
Underlying format is an implementation detail when format any is used, so oneDNN can't guarantee that it will stay the same. We don't recommend application/frameworks to hard-code particular blocked formats for oneDNN primitives, because new implementations might use different formats and it will introduce unnesessary reorders.

Overall If possible use any for all convolution/ip/matmul memory descriptors so oneDNN could choose the best implementation. If format propagation is not possible for data, keep weights in format any. The rest of primitives (lrn, pool, concat., etc) should work on whatever format comes from the previous layer (conv/ip/matmul). If we optimize a conv for a particular format (nhwc or nChw8c) we make sure that other primitives work fast on that memory format to provide model-level speed-up.

Regards, Igor

Jul 08 '22 00:07 igorsafo

We have an article in oneDNN developer guide on this topic: Memory Format Propagation

Jul 08 '22 00:07 vpirogov

Hi @igorsafo and @vpirogov,

Thanks for your response. I'm not hard-coding memory format in the destination memory for any operation primitive. I always use any. But, I will conditionally insert source reorders(reorder source to optimized format i.e. nChw8c and avx2, nChw16c on avx512) for other primitives(lrn, pool, concat.). Below is one of the reasons for doing this: Let's say we have graph architecture as ConvLayer-->customLayer-->maxPoolLayer. Here customLayer is not a oneDNN layer and it will always operate in nchw format.

For ConvLayer, I will use the destination memory format as the any, and let's say OneDNN chooses the destination memory format as the nChw16c.
As customLayer operates in nchw, it reorders source from nChw16c to nchw and destination memory also in the nchw layout.
In the maxPoolLayer, src memory layout will be nchw, and pool operation performed on this layout and destination memory also in nchw format. I have observed that the pooling operation performs slower for nchw source memory compared with blocked layerout(nChw8c/nChw16c) source memory. Belowe are the results: Because of this reason, if the source is not a blocked format, we manually insert a source reorder(nchw-->blocked layerout i.e., nChw8c on avx2 and nChw16c on avx512) operation in layers like maxpool, lrn, concat .,Etc.
In maxPoolLayer, I have inserted src reorder only because src memory is not in optimal layeout. Suppose, if we have ConvLayer-->maxPoolLayer then no src reorder will be inserted in maxPoolLayer layer.
So, The problem for us is that I always insert src reorder to blocked layout in the max pool and some other layers if the source is not a blocked layout. For, v1.4 there is no issue because OneDNN also always chooses a blocked layout as the optimal format. For, v2.6, more reorders are inserted because OneDNN chooses 'nhwc' as an optimal format other than blocked layouts and still our assumption is blocked layout is the optimal format.

Moreover, I have observed that for convolution operation, nChw16c layout performs better compared with the nhwc layout. You can see Convolution layer timings in the attached table in the observed behavior section in the first comment.

Jul 08 '22 01:07 Hari-MathWorks

@Hari-MathWorks Thank you for the explanation of the problem!

So if I understand correctly there are 2 issues:

Due to changed format there are additional reorders. Is it possible to store the tag used by a convolution so the rest of primitives will use it? So if conv decides to work on nChw8c the rest will work on it, if conv decides to work on nhwc the rest will work on it as well. This way unnesessary reorders will be removed and all primitives will use optimized implementation.
Convolution with any in v2.6 is slower than Convolution with any in v1.4. Could you please provide more details about your configuration? Compiler, threading, CPU SKU, etc, so I can create an internal issue to investigate the regression?

Jul 08 '22 17:07 igorsafo

Hi @igorsafo ,

Thanks for your response.

Regarding additional reorders: We can store the convolution output memory format and we can reuse it. But, the problem is that we can not have a convolution layer in the deep learning network every time. We can have a network without convolution layer. Because of that reason, we use nChw8c for avx2 and nChw16c for avx512 by default as optimal formats for 1.4. To eliminate these reorders, we can follow a new rule i.e., nChw8c for avx2 and nhwc for avx512 as optimal formats for 2.6.
Regarding regression in v2.6: I have different configurations for the convolution layer in GoogleNet that were consuming more time compared with v1.4. You can find these configurations from the attached verbose files in the first comment. You can see the below graph(Created using the timings from the above verbose output text files), where we can see the performance of 39 convolution layers in the GoogleNet in order. You can see here that v2.6 performs slower compared with v1.4 overall. I have also observed the drop in concat layer also. Please find the below graph for more details(Created using the timings from the above verbose output text files). Fo the other details like Compiler, threading, CPU, please refer to the environment section in the first comment.

Jul 08 '22 18:07 Hari-MathWorks

Hi @igorsafo , @vpirogov ,

+Adding another observation. I'm facing this regression problem for RNN networks too. I have created a separate issue(https://github.com/oneapi-src/oneDNN/issues/1415) for RNN.

Jul 11 '22 19:07 Hari-MathWorks

Hi @Hari-MathWorks , Thank you. I created an internal tracker to investigate performance regression of convolution primitive.

Jul 11 '22 19:07 igorsafo

loop @aice-support for MathWorks support

Aug 10 '22 15:08 louie-tsai

Hi @Hari-MathWorks , I apologize for the delayed response.

Our internal tests showed convolutions in v2.6 (which uses nhwc format) operated faster compared to v1.4 (which uses blocked format). We were able to revert the convolution kernel for oneDNN v2.6 from ‘brgconv’ back to ‘jit_avx512’. This results in forced blocked form layout (nChw16C on avx512), which eliminates the extra reorders and produces similar results to oneDNN v1.4 (see table below).

How to perform kernel modifications:

Navigate to the cpu_convolution_list.cpp file under oneDNN/src/cpu/
Open up the file and comment out the following three lines:

line 84: CPU_INSTANCE_AVX512(brdgmm_dw_convolution_fwd_t) line 86: CPU_INSTANCE_AVX512(brgemm_1x1_convolution_fwd_t<avx512_core>) line 87: CPU_INSTANCE_AVX512(brgemm_convolution_fwd_t<avx512_core>)

Build the oneDNN library again

Please let us know if these modifications resolve the issue.

Regards, Orel

Oct 18 '22 23:10 yehudaorel

oneDNN
oneDNN copied to clipboard

Performance regression from v1.4 to v2.6

Summary

Version

Environment

Steps to reproduce

Observed behavior

oneDNN oneDNN copied to clipboard

Performance regression from v1.4 to v2.6

Summary

Version

Environment

Steps to reproduce

Observed behavior

oneDNN
oneDNN copied to clipboard