oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

Assertion failure in brgemm in debug build on G3 aarch64 machine

Open Ryo-not-rio opened this issue 1 year ago • 6 comments

Summary

When running ctest -R cpu-tutorials-matmul-matmul-quantization-cpp with a debug build on a G3 aarch64 machine, an assertion failure can be seen in brgemm_matmul_utils.cpp.

Version

v3.6.0 (commit fbb277db5a9c67fc646264f8ff68eb6f24f411fb)

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

  • CPU make and model - aarch64
  • OS version - 22.04.1-Ubuntu
  • Compiler version - 11.4.0
  • CMake version 3.22.1
  • CMake output log
  • git hash - fbb277db5a9c67fc646264f8ff68eb6f24f411fb

Steps to reproduce

  1. Build in debug mode
  2. Run ctest -R cpu-tutorials-matmul-matmul-quantization-cpp

Observed behavior

Test fails with following message:

cpu-tutorials-matmul-matmul-quantization-cpp: /home/ubuntu/oneDNN/src/cpu/aarch64/matmul/brgemm_matmul_utils.cpp:133: dnnl::impl::status_t dnnl::impl::cpu::aarch64::matmul::check_isa_with_datatype(dnnl::impl::cpu::aarch64::cpu_isa_t, const dnnl::impl::cpu::aarch64::matmul::brgemm_matmul_conf_utils_t&): Assertion `bm_conf_utils.is_f32()' failed.

Expected behavior

Test passes

Ryo-not-rio avatar May 16 '24 16:05 Ryo-not-rio

@jondea, @cfRod This is aarch64 platform specific issue. Can you please have a look into this? Thanks.

rupakroyintel avatar May 19 '24 01:05 rupakroyintel

Hi @rupakroyintel, this appears to be an issue from the JIT'ed path on AArch64. Linking @vineelabhinav as the author of the following PR https://github.com/oneapi-src/oneDNN/pull/1815/files that added brgemm.

cfRod avatar May 21 '24 12:05 cfRod

@vineelabhinav Can you please look into this issue? Thanks.

rupakroyintel avatar Jun 10 '24 05:06 rupakroyintel

Hi @Ryo-not-rio @cfRod @rupakroyintel , This is expected behaviour from JIT side. • We have implemented brgemm for only f32 data type. • When oneDNN is built in Release mode, it skips our implementation and goes to the reference implementation and executes that. Therefore the test does not fail. • But in Debug mode, oneDNN tries to implement our brgemm and if it fails, it stops there and does not go to reference implementation. Therefore the test does not pass in Debug mode. We have intentionally added this assertion so that it falls back to reference implementation when f32 data type is not used.

Shreyas-fuj avatar Jun 19 '24 05:06 Shreyas-fuj

As far as possible, behavior in debug mode should match release mode, and assertions should not be expected in the test suite. Sorry if I've misunderstood, but is there a way to use normal flow control to fall back to reference even in debug mode rather than assert?

jondea avatar Jun 19 '24 07:06 jondea

Bumping this issue as it was very frustrating for me to discover this bug. It stops development in debug mode, for a while I thought it was an issue in the work I was doing, wasting quite a bit of time, until I found this issue. I agree with @jondea that logical behaviour in debug mode should match release mode.

theComputeKid avatar Jun 24 '24 10:06 theComputeKid

Hi, I have fixed this issue in the PR : https://github.com/oneapi-src/oneDNN/pull/1985 Please have a look.

Shreyas-fuj avatar Jul 09 '24 08:07 Shreyas-fuj