Summary

Depthwise backward-weights on AArch64 SVE-256 produces incorrect results for strided, padded cases (e.g., C=24, Kh=3, Sh=2, Ph=1). PyTorch test TestConvolutionNN.test_Conv2d_OneDNN fails, while benchdnn does not flag the issue. A regression (gtest) comparing a legacy blocked-oh path vs a new per-row path exposes the defect.
Fix merged in PR #4081.

cc. @Sqvid

Version

oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f)
Also reproduced with oneDNN v3.7.1

Environment

CPU: AArch64 SVE (256-bit) (Neoverse V1)
oneDNN runtime: OpenMP, nthr=32
PyTorch (arm/aarch64 build) using oneDNN backend
Python 3.10

Steps to reproduce

1) PyTorch unit test (fails)

# ONEDD_VERBOSE=all to capture impl & commit
export ONEDD_VERBOSE=all
python pytorch/test/nn/test_convolution.py TestConvolutionNN.test_Conv2d_OneDNN

Typical verbose snippet at failure:

onednn_verbose,v1,info,oneDNN v3.9.1 (commit 80a3a8e...)
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,forward_training,...
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,backward_weights,...
g24mb1_ic24oc24_ih6oh3kh3sh2ph1_iw6ow3kw3sw2pw1

2) Detailed C++ gtest reproduction steps

Start from oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f) on AArch64 SVE-256 (Neoverse V1).

Prerequisites

Replace tests/gtests/test_convolution_backward_weights_dw_compare.cpp and src/cpu/aarch64/jit_uni_dw_convolution.cpp with the supplied versions (attachments)
File: tests/gtests/test_convolution_backward_weights_dw_compare.cpp (attachment)
Compares legacy vs new AArch64 DW BWD_W (env-switchable):
- ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 → legacy path
- unset → new per-row path
Descriptor used: g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Build configuration

# Configure with tests enabled
cmake -S . -B build -DDNNL_BUILD_TESTS=ON

# Rebuild so both the kernel and gtest pick up changes
cmake --build build --target all -- -j$(nproc)

Run regression test

cd build && ctest -V -R test_convolution_backward_weights_dw_compare

Optional: benchdnn verification

ONEDNN_VERBOSE=all ./build/tests/benchdnn/benchdnn --conv --dir=BWD_W --fast-ref=false g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Logs & diff evidence

Each run writes depthwise_bwdw_compare.log next to the binary (build/tests/gtests/depthwise_bwdw_compare.log)
Header shows both impl IDs, benchdnn descriptor, and replay command (see tests/gtests/test_convolution_backward_weights_dw_compare.cpp:186-201)

Observed behavior

PyTorch test failure:

AssertionError: Tensor-likes are not close!
Mismatched elements: 72 / 216 (33.3%)
Greatest absolute difference: 3.0

OneDNN chooses jit_dw:sve_256 for both FWD and BWD_W on the above config.
gtest A/B shows legacy path accumulates extra bottom-row contributions on strided, padded cases (duplicate accumulation at tile boundaries). New per-row path matches a naïve reference.
benchdnn did not reproduce the mismatch (even with --fast-ref=false and buffer replay).

Workaround validated: removing the AArch64 jit BWD_W (SVE-256) path from the CPU convolution list avoids the failure (fallback path passes like it already does for Neoverse N1 & Neoverse V2).

Expected behavior

Backward-weights results should match the naïve reference (and mkldnn-disabled PyTorch path) with zero elementwise diffs for these configs.

Additional notes

After applying the fix from PR #4081, PyTorch unit tests and nightly suite pass; the attached gtest shows new path == reference.
Toggle: ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 (legacy) vs unset (new).

Attachments

src/cpu/aarch64/jit_uni_dw_convolution.cpp kernel version with legacy/new path toggle
- jit_uni_dw_convolution.cpp
tests/gtests/test_convolution_backward_weights_dw_compare.cpp (gtest repro; includes env flag to toggle old/new)
- test_convolution_backward_weights_dw_compare.cpp

Related PR

https://github.com/uxlfoundation/oneDNN/pull/4081

Oct 13 '25 12:10 gassan-arm

Hi Gassan, if you could find the smallest reproducible shape, and the actual value of the input tensors that fail in PyTorch, that would be very helpful

Oct 13 '25 13:10 Ryo-not-rio

Hi @Ryo-not-rio, the smallest possible pytorch failure case is the one above. However it can also be replicated with C= 8 H=W=6 S=2

Attributes for kernel selection (jit 256 bwdw):

C multiples of 8
H = W >= 6 (and multiples of 2)
Dilation=1
Stride>1

The input and weight values are ones in the test.

Oct 13 '25 14:10 gassan-arm

-DDNNL_ENABLE_CONCURRENT_EXEC=ON

Is this needed to trigger your bug?

Oct 13 '25 14:10 Sqvid

@Sqvid no. I've just removed it.

Oct 13 '25 15:10 gassan-arm

With C= 8 H=W=6 S=2, could you post the output of printing the input and weight tensors in PyTorch? It would be good to get an idea of the dtype we're dealing with. Is it floats, negative numbers, Infs etc.

Oct 14 '25 10:10 Ryo-not-rio

BWDW JIT 256 error reproducible through gtests, but not through benchdnn

Summary

Version

Environment

Steps to reproduce

1) PyTorch unit test (fails)

2) Detailed C++ gtest reproduction steps

Prerequisites

Build configuration

Run regression test

Optional: benchdnn verification

Logs & diff evidence

Observed behavior

Expected behavior

Additional notes

Attachments

Related PR