llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[SYCL] Add support for soft_max ALiBi

Open AidanBeltonS opened this issue 1 year ago • 4 comments

This PR updates the sycl implementation of softmax to support ALiBi. The implementation comes from the existing CUDA impl.

This has been tested on the A100 GPU and all softmax tests are passing with this change.

AidanBeltonS avatar Feb 21 '24 16:02 AidanBeltonS

@NeoZhangJianyu, @abhilash1910, @Alcpz, please review

AidanBeltonS avatar Feb 21 '24 16:02 AidanBeltonS

Thanks @AidanBeltonS , I was inclined on this feature. I will be reviewing this, tagging : https://github.com/ggerganov/llama.cpp/pull/5488

abhilash1910 avatar Feb 21 '24 16:02 abhilash1910

Not sure whether it is expected

GPU MAX1100 passed all

but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)

C:\Users\gta\Documents\llama.cpp\build>bin\test-backend-ops.exe test -b SYCL0 -o SOFT_MAX
Testing 5 backends

Backend 1/5 (CPU)
  Skipping
Backend 2/5 (SYCL0)
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI,    compute capability 1.3,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 2635530240
  Device 1: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI,    compute capability 3.0,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 2635530240
  Device 2: Intel(R) Core(TM) Ultra 7 1003H,    compute capability 3.0,
        max compute_units 22,   max work group size 8192,       max sub group size 64,  global mem size 3961380864
  Device 3: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 22,   max work group size 67108864,   max sub group size 64,  global mem size 3961380864
Using device 0 (Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI) as main device
  Backend name: SYCL
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000411710 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004343327 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000009382 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000014476 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000759554 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.001201117 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000001676 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000009981 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.012946110 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.017667745 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001769479 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004120808 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.009613720 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.006988272 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000894425 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002199995 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=8.000000): ←[1;32mOK←[0m
  1391/1407 tests passed
  Backend SYCL: ←[1;31mFAIL←[0m

airMeng avatar Feb 22 '24 11:02 airMeng

Not sure whether it is expected

GPU MAX1100 passed all

but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)

Thanks for your testing, are these tests passing on TIP without my changes?

AidanBeltonS avatar Feb 22 '24 12:02 AidanBeltonS

Not sure whether it is expected GPU MAX1100 passed all but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)

Thanks for your testing, are these tests passing on TIP without my changes?

ohh, even more faults. Do you have enough bandwidth to give a look?

C:\Users\gta\source\repos\llama.cpp\build>bin\test-backend-ops.exe test -b SYCL0 -o SOFT_MAX
Testing 5 backends

Backend 1/5 (CPU)
  Skipping
Backend 2/5 (SYCL0)
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI,    compute capability 1.3,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 2635530240
  Device 1: Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI,    compute capability 3.0,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 2635530240
  Device 2: Intel(R) Core(TM) Ultra 7 1003H,    compute capability 3.0,
        max compute_units 22,   max work group size 8192,       max sub group size 64,  global mem size 3961380864
  Device 3: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 22,   max work group size 67108864,   max sub group size 64,  global mem size 3961380864
Using device 0 (Intel(R) Graphics i gfx-driver-ci-master-15876 DCH-I RI) as main device
  Backend name: SYCL
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001620635 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.001864620 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000008599 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.000005020 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003769 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004918 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004890 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004527 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005075 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005348 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002153705 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000791032 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005779 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004213 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003770 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000006002 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004923 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005270 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000017064 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000010618 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.028220173 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.010782086 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.005262445 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=0.000000): [SOFT_MAX] NMSE = 0.004439145 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004765 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003865 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003030 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004553 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005057 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004978 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.031724865 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=1.000000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.016813136 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004357 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000004029 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000003563 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[15,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005769 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,16,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005256 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,15,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.000005167 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1024,1024,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.005744414 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[1023,1023,1,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.002133673 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=0.000000): ←[1;32mOK←[0m
  SOFT_MAX(type=f32,ne=[16,2,32,1],mask=0,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.040746960 > 0.000000100 ←[1;31mFAIL←[0m
  SOFT_MAX(type=f32,ne=[32,2,32,1],mask=1,scale=0.100000,max_bias=8.000000): [SOFT_MAX] NMSE = 0.016692125 > 0.000000100 ←[1;31mFAIL←[0m
  1365/1407 tests passed
  Backend SYCL: ←[1;31mFAIL←[0m

airMeng avatar Feb 23 '24 01:02 airMeng

Not sure whether it is expected GPU MAX1100 passed all but on MTL iGPU, f16 off, win11, Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)

Thanks for your testing, are these tests passing on TIP without my changes?

ohh, even more faults. Do you have enough bandwidth to give a look?

I do not have enough time just now as I have some higher priority tasks to look into. Could you create an issue for this and ping me in it, as it seems like this is not an issue introduced by this PR? Then, I can come back and take a look when I have more bandwidth.

AidanBeltonS avatar Feb 23 '24 10:02 AidanBeltonS

@AidanBeltonS could you please rebase to latest master to resolve the format CI ? Thanks

abhilash1910 avatar Feb 26 '24 11:02 abhilash1910