BWDW JIT 256 error reproducible through gtests, but not through benchdnn
Summary
Depthwise backward-weights on AArch64 SVE-256 produces incorrect results for strided, padded cases (e.g., C=24, Kh=3, Sh=2, Ph=1). PyTorch test TestConvolutionNN.test_Conv2d_OneDNN fails, while benchdnn does not flag the issue. A regression (gtest) comparing a legacy blocked-oh path vs a new per-row path exposes the defect.
Fix merged in PR #4081.
cc. @Sqvid
Version
- oneDNN v3.9.1 (commit
80a3a8e745d2f0186e674b0af9332fd6e074c94f) - Also reproduced with oneDNN v3.7.1
Environment
- CPU: AArch64 SVE (256-bit) (Neoverse V1)
- oneDNN runtime: OpenMP,
nthr=32 - PyTorch (arm/aarch64 build) using oneDNN backend
- Python 3.10
Steps to reproduce
1) PyTorch unit test (fails)
# ONEDD_VERBOSE=all to capture impl & commit
export ONEDD_VERBOSE=all
python pytorch/test/nn/test_convolution.py TestConvolutionNN.test_Conv2d_OneDNN
Typical verbose snippet at failure:
onednn_verbose,v1,info,oneDNN v3.9.1 (commit 80a3a8e...)
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,forward_training,...
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,backward_weights,...
g24mb1_ic24oc24_ih6oh3kh3sh2ph1_iw6ow3kw3sw2pw1
2) Detailed C++ gtest reproduction steps
Start from oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f) on AArch64 SVE-256 (Neoverse V1).
Prerequisites
- Replace
tests/gtests/test_convolution_backward_weights_dw_compare.cppandsrc/cpu/aarch64/jit_uni_dw_convolution.cppwith the supplied versions (attachments) - File:
tests/gtests/test_convolution_backward_weights_dw_compare.cpp(attachment) - Compares legacy vs new AArch64 DW BWD_W (env-switchable):
ONEDNN_AARCH64_DW_BWDW_USE_OLD=1→ legacy path- unset → new per-row path
- Descriptor used:
g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1
Build configuration
# Configure with tests enabled
cmake -S . -B build -DDNNL_BUILD_TESTS=ON
# Rebuild so both the kernel and gtest pick up changes
cmake --build build --target all -- -j$(nproc)
Run regression test
cd build && ctest -V -R test_convolution_backward_weights_dw_compare
Optional: benchdnn verification
ONEDNN_VERBOSE=all ./build/tests/benchdnn/benchdnn --conv --dir=BWD_W --fast-ref=false g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1
Logs & diff evidence
- Each run writes
depthwise_bwdw_compare.lognext to the binary (build/tests/gtests/depthwise_bwdw_compare.log) - Header shows both impl IDs, benchdnn descriptor, and replay command (see
tests/gtests/test_convolution_backward_weights_dw_compare.cpp:186-201)
Observed behavior
- PyTorch test failure:
AssertionError: Tensor-likes are not close! Mismatched elements: 72 / 216 (33.3%) Greatest absolute difference: 3.0 - OneDNN chooses jit_dw:sve_256 for both FWD and BWD_W on the above config.
- gtest A/B shows legacy path accumulates extra bottom-row contributions on strided, padded cases (duplicate accumulation at tile boundaries). New per-row path matches a naïve reference.
- benchdnn did not reproduce the mismatch (even with
--fast-ref=falseand buffer replay).
Workaround validated: removing the AArch64 jit BWD_W (SVE-256) path from the CPU convolution list avoids the failure (fallback path passes like it already does for Neoverse N1 & Neoverse V2).
Expected behavior
Backward-weights results should match the naïve reference (and mkldnn-disabled PyTorch path) with zero elementwise diffs for these configs.
Additional notes
- After applying the fix from PR #4081, PyTorch unit tests and nightly suite pass; the attached gtest shows new path == reference.
- Toggle: ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 (legacy) vs unset (new).
Attachments
-
src/cpu/aarch64/jit_uni_dw_convolution.cppkernel version with legacy/new path toggle -
tests/gtests/test_convolution_backward_weights_dw_compare.cpp(gtest repro; includes env flag to toggle old/new)
Related PR
- https://github.com/uxlfoundation/oneDNN/pull/4081
Hi Gassan, if you could find the smallest reproducible shape, and the actual value of the input tensors that fail in PyTorch, that would be very helpful
Hi @Ryo-not-rio, the smallest possible pytorch failure case is the one above. However it can also be replicated with
C= 8 H=W=6 S=2
Attributes for kernel selection (jit 256 bwdw):
- C multiples of 8
- H = W >= 6 (and multiples of 2)
- Dilation=1
- Stride>1
The input and weight values are ones in the test.
-DDNNL_ENABLE_CONCURRENT_EXEC=ON
Is this needed to trigger your bug?
@Sqvid no. I've just removed it.
With C= 8 H=W=6 S=2, could you post the output of printing the input and weight tensors in PyTorch? It would be good to get an idea of the dtype we're dealing with. Is it floats, negative numbers, Infs etc.