[SYCL] Add support for soft_max ALiBi
This PR updates the sycl implementation of softmax to support ALiBi. The implementation comes from the existing CUDA impl.
This has been tested on the A100 GPU and all softmax tests are passing with this change.
@NeoZhangJianyu, @abhilash1910, @Alcpz, please review
Thanks @AidanBeltonS , I was inclined on this feature. I will be reviewing this, tagging : https://github.com/ggerganov/llama.cpp/pull/5488
Not sure whether it is expected
GPU MAX1100 passed all
but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)
C:\Users\gta\Documents\llama.cpp\build>bin\test-backend-ops.exe test -b SYCL0 -o SOFT_MAX
Testing 5 backends
Backend 1/5 (CPU)
Skipping
Backend 2/5 (SYCL0)
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16: no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
Device 0: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI, compute capability 1.3,
max compute_units 128, max work group size 1024, max sub group size 32, global mem size 2635530240
Device 1: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI, compute capability 3.0,
max compute_units 128, max work group size 1024, max sub group size 32, global mem size 2635530240
Device 2: Intel(R) Core(TM) Ultra 7 1003H, compute capability 3.0,
max compute_units 22, max work group size 8192, max sub group size 64, global mem size 3961380864
Device 3: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 22, max work group size 67108864, max sub group size 64, global mem size 3961380864
Using device 0 (Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI) as main device
Backend name: SYCL
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000411710 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004343327 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000009382 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000014476 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000759554 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.001201117 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000001676 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000009981 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.012946110 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.017667745 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001769479 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004120808 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.009613720 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.006988272 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000894425 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002199995 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
1391/1407 tests passed
Backend SYCL: ←[1;31mFAIL←[0m
Not sure whether it is expected
GPU MAX1100 passed all
but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)
Thanks for your testing, are these tests passing on TIP without my changes?
Not sure whether it is expected GPU MAX1100 passed all but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)
Thanks for your testing, are these tests passing on TIP without my changes?
ohh, even more faults. Do you have enough bandwidth to give a look?
C:\Users\gta\source\repos\llama.cpp\build>bin\test-backend-ops.exe test -b SYCL0 -o SOFT_MAX
Testing 5 backends
Backend 1/5 (CPU)
Skipping
Backend 2/5 (SYCL0)
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16: no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
Device 0: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI, compute capability 1.3,
max compute_units 128, max work group size 1024, max sub group size 32, global mem size 2635530240
Device 1: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI, compute capability 3.0,
max compute_units 128, max work group size 1024, max sub group size 32, global mem size 2635530240
Device 2: Intel(R) Core(TM) Ultra 7 1003H, compute capability 3.0,
max compute_units 22, max work group size 8192, max sub group size 64, global mem size 3961380864
Device 3: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 22, max work group size 67108864, max sub group size 64, global mem size 3961380864
Using device 0 (Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI) as main device
Backend name: SYCL
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001620635 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001864620 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000008599 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000005020 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003769 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004918 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004890 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004527 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005075 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005348 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002153705 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000791032 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005779 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004213 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003770 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000006002 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004923 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005270 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000017064 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000010618 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.028220173 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.010782086 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.005262445 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004439145 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004765 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003865 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003030 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004553 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005057 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004978 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.031724865 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.016813136 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004357 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004029 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003563 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005769 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005256 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005167 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.005744414 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002133673 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.040746960 > 0.000000100 ←[1;31mFAIL←[0m
SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.016692125 > 0.000000100 ←[1;31mFAIL←[0m
1365/1407 tests passed
Backend SYCL: ←[1;31mFAIL←[0m
Not sure whether it is expected GPU MAX1100 passed all but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)
Thanks for your testing, are these tests passing on TIP without my changes?
ohh, even more faults. Do you have enough bandwidth to give a look?
I do not have enough time just now as I have some higher priority tasks to look into. Could you create an issue for this and ping me in it, as it seems like this is not an issue introduced by this PR? Then, I can come back and take a look when I have more bandwidth.
@AidanBeltonS could you please rebase to latest master to resolve the format CI ? Thanks