oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

Large kernels can generate xbyak_aarch64 exceptions

Open Sqvid opened this issue 2 months ago • 6 comments

Summary

This is a follow-on from #4055. There exist several cases in the code where it is possible for very large kernels to be generated; this in turn can cause Xbyak_aarch64 to throw an exception when one tries to place a Label with a large jump address.

For example consider the following:

$ ./build/tests/benchdnn/benchdnn --conv --impl='brgconv:sve_128' --canonical=true --dt=bf16 --attr-post-ops=gelu_tanh g1ic64ih1000oc64oh1000kh3ph128dh127         

bad err=15 in Xbyak::Error
terminate called after throwing an instance of 'Xbyak_aarch64::Error'
  what():  illegal immediate parameter (range error)
zsh: abort (core dumped)

This gets thrown when underlying CodeArray grows large and you try to place a label at the end. This is because the Xbyak_aarch64::LabelManager tries to calculate program-counter relative address, and if the value of the immediate value is too large then the instruction is malformed.

Since this error can pop up in a variety of scenarios, throws an uncaught exception, and is unrelated to any particular kernel, I think we really need to find a good way to address it.

cc: @vpirogov, @dzarukin, @jondea, @Shreyas-fuj

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

  • CPU make and model: Neoverse V1 (flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng)
  • git hash: bcc0ca00842739f54006d93917d807ea68616bc8

Steps to reproduce

$ ./build/tests/benchdnn/benchdnn --conv --impl='brgconv:sve_128' --canonical=true --dt=bf16 --attr-post-ops=gelu_tanh g1ic64ih1000oc64oh1000kh3ph128dh127         

Observed behavior

Xbyak_aarch64 exception thrown.

Expected behavior

Graceful error handling.

Sqvid avatar Oct 08 '25 13:10 Sqvid

@Shreyas-fuj @kasturedeeksha

This can happen with the 256-bit kernels too, just on larger shapes.

$ ./build/tests/benchdnn/benchdnn -v5 --conv --dt=bf16 --attr-post-ops=gelu_tanh+gelu_erf g1mb1ic1000ih1000iw1000oc1000oh1000ow1000kh2kw2sh1sw1ph0pw0dh0dw0
create: --conv --dt=bf16:bf16:bf16 --attr-post-ops=gelu_tanh+gelu_erf g1mb1ic1000ih1000oc1000oh1000kh2ph0
oneDNN implementation: brgconv:sve_256
bad err=15 in Xbyak::Error
terminate called after throwing an instance of 'Xbyak_aarch64::Error'
  what():  illegal immediate parameter (range error)
zsh: abort (core dumped)  ./build/tests/benchdnn/benchdnn -v5 --conv --dt=bf16  

Sqvid avatar Oct 10 '25 10:10 Sqvid

This looks like Xbyak is hitting AArch64 ISA limitations. @kawakami-k, what's your take here?

You probably want to adjust kernel generator to avoid generating large kernels anyway. Large kernels waste memory and, at least on x64, mess up with instruction cache.

vpirogov avatar Oct 21 '25 17:10 vpirogov

For instance https://github.com/uxlfoundation/oneDNN/issues/2007 looks like the case where huge kernel causes issues.

vpirogov avatar Oct 21 '25 18:10 vpirogov

I myself would like to resolve this issue, but it is quite difficult for me to find the time. I have shared the issue with members of other departments at Fujitsu.

kawakami-k avatar Oct 21 '25 22:10 kawakami-k

I see a couple of possible avenues for fixes, but none of them feel great.

  • Create a label function which checks the code size, and if it's too big, uses a temporary register to calculate the offset.
  • Just do the work to make the kernels smaller, or return unimplemented in pathological cases. This could be brittle, but in practice would probably work and be the most performant
  • Catch the error at primitive init time and run ref instead of crashing. This would also be a lot of scaffolding.

jondea avatar Oct 22 '25 08:10 jondea

  • Create a label function which checks the code size, and if it's too big, uses a temporary register to calculate the offset.

By the way, how big the kernel has to be to exceed the offset range?

  • Catch the error at primitive init time and run ref instead of crashing. This would also be a lot of scaffolding.

This one would be tricky. Specific implementation is picked by primitive descriptor before any code is generated. So when code generation fails during primitive creation it's too late to fall back.

vpirogov avatar Oct 22 '25 15:10 vpirogov