oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

BWDW JIT 256 error reproducible through gtests, but not through benchdnn

Open gassan-arm opened this issue 2 months ago • 5 comments

Summary

Depthwise backward-weights on AArch64 SVE-256 produces incorrect results for strided, padded cases (e.g., C=24, Kh=3, Sh=2, Ph=1). PyTorch test TestConvolutionNN.test_Conv2d_OneDNN fails, while benchdnn does not flag the issue. A regression (gtest) comparing a legacy blocked-oh path vs a new per-row path exposes the defect.
Fix merged in PR #4081.

cc. @Sqvid

Version

  • oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f)
  • Also reproduced with oneDNN v3.7.1

Environment

  • CPU: AArch64 SVE (256-bit) (Neoverse V1)
  • oneDNN runtime: OpenMP, nthr=32
  • PyTorch (arm/aarch64 build) using oneDNN backend
  • Python 3.10

Steps to reproduce

1) PyTorch unit test (fails)

# ONEDD_VERBOSE=all to capture impl & commit
export ONEDD_VERBOSE=all
python pytorch/test/nn/test_convolution.py TestConvolutionNN.test_Conv2d_OneDNN

Typical verbose snippet at failure:

onednn_verbose,v1,info,oneDNN v3.9.1 (commit 80a3a8e...)
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,forward_training,...
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,backward_weights,...
g24mb1_ic24oc24_ih6oh3kh3sh2ph1_iw6ow3kw3sw2pw1

2) Detailed C++ gtest reproduction steps

Start from oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f) on AArch64 SVE-256 (Neoverse V1).

Prerequisites

  • Replace tests/gtests/test_convolution_backward_weights_dw_compare.cpp and src/cpu/aarch64/jit_uni_dw_convolution.cpp with the supplied versions (attachments)
  • File: tests/gtests/test_convolution_backward_weights_dw_compare.cpp (attachment)
  • Compares legacy vs new AArch64 DW BWD_W (env-switchable):
    • ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 → legacy path
    • unset → new per-row path
  • Descriptor used: g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Build configuration

# Configure with tests enabled
cmake -S . -B build -DDNNL_BUILD_TESTS=ON

# Rebuild so both the kernel and gtest pick up changes
cmake --build build --target all -- -j$(nproc)

Run regression test

cd build && ctest -V -R test_convolution_backward_weights_dw_compare

Optional: benchdnn verification

ONEDNN_VERBOSE=all ./build/tests/benchdnn/benchdnn --conv --dir=BWD_W --fast-ref=false g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Logs & diff evidence

  • Each run writes depthwise_bwdw_compare.log next to the binary (build/tests/gtests/depthwise_bwdw_compare.log)
  • Header shows both impl IDs, benchdnn descriptor, and replay command (see tests/gtests/test_convolution_backward_weights_dw_compare.cpp:186-201)

Observed behavior

  • PyTorch test failure:
    AssertionError: Tensor-likes are not close!
    Mismatched elements: 72 / 216 (33.3%)
    Greatest absolute difference: 3.0
    
  • OneDNN chooses jit_dw:sve_256 for both FWD and BWD_W on the above config.
  • gtest A/B shows legacy path accumulates extra bottom-row contributions on strided, padded cases (duplicate accumulation at tile boundaries). New per-row path matches a naïve reference.
  • benchdnn did not reproduce the mismatch (even with --fast-ref=false and buffer replay).

Workaround validated: removing the AArch64 jit BWD_W (SVE-256) path from the CPU convolution list avoids the failure (fallback path passes like it already does for Neoverse N1 & Neoverse V2).

Expected behavior

Backward-weights results should match the naïve reference (and mkldnn-disabled PyTorch path) with zero elementwise diffs for these configs.

Additional notes

  • After applying the fix from PR #4081, PyTorch unit tests and nightly suite pass; the attached gtest shows new path == reference.
  • Toggle: ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 (legacy) vs unset (new).

Attachments

Related PR

  • https://github.com/uxlfoundation/oneDNN/pull/4081

gassan-arm avatar Oct 13 '25 12:10 gassan-arm

Hi Gassan, if you could find the smallest reproducible shape, and the actual value of the input tensors that fail in PyTorch, that would be very helpful

Ryo-not-rio avatar Oct 13 '25 13:10 Ryo-not-rio

Hi @Ryo-not-rio, the smallest possible pytorch failure case is the one above. However it can also be replicated with C= 8 H=W=6 S=2

Attributes for kernel selection (jit 256 bwdw):

  • C multiples of 8
  • H = W >= 6 (and multiples of 2)
  • Dilation=1
  • Stride>1

The input and weight values are ones in the test.

gassan-arm avatar Oct 13 '25 14:10 gassan-arm

-DDNNL_ENABLE_CONCURRENT_EXEC=ON

Is this needed to trigger your bug?

Sqvid avatar Oct 13 '25 14:10 Sqvid

@Sqvid no. I've just removed it.

gassan-arm avatar Oct 13 '25 15:10 gassan-arm

With C= 8 H=W=6 S=2, could you post the output of printing the input and weight tensors in PyTorch? It would be good to get an idea of the dtype we're dealing with. Is it floats, negative numbers, Infs etc.

Ryo-not-rio avatar Oct 14 '25 10:10 Ryo-not-rio