ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Error in MUL_MAT with GGML_VULKAN_CHECK_RESULTS

Open SRHMorris opened this issue 1 year ago • 1 comments

This is using whisper.cpp from commit c4e1861d2c24b186cbbac6c07480aaa298b0e6d9 compiled with GGML_VULKAN=ON and GGML_VULKAN_CHECK_RESULTS=ON (enabled because I was trying to debug a very poor transcription on a specific GPU).

...
421751 node_307 op=ADD avg_err=0
421752 node_310 op=MUL_MAT avg_err=0.00160538
421753 node_311 op=SOFT_MAX avg_err=0.00104975
ERROR: avg_err=0.106302 in MUL_MAT (check 421754)
tensor=00000151636AEB90 tensor->name=node_312 tensor->type: f32 ne0=64 nb0=4 ne1=1 nb1=256 ne2=8 nb2=256 ne3=1 nb3=2048 offset=0
src0=00000151636AE740 op=VIEW type=f16 ne0=1500 nb0=2 ne1=64 nb1=3000 ne2=8 nb2=192000 ne3=1 nb3=1536000 offset=7680000
src1=00000151636AEA20 op=SOFT_MAX type=f32 ne0=1500 nb0=4 ne1=1 nb1=6000 ne2=8 nb2=6000 ne3=1 nb3=48000 offset=0
First error: result=-0.58789 correct=-0.362305 i3=0 i2=0 i1=0 i0=0

Result:
               0       1       2       3       4       5       6       7       8       9
      0:   -0.59
      1:   -0.19
      2:    1.50
      3:    0.31
      4:    1.36
      5:    1.56
      6:   -1.42
      7:   -0.06
      8:   -0.38
      9:   -0.13

Correct:
               0       1       2       3       4       5       6       7       8       9
      0:   -0.36
      1:    0.17
      2:    0.26
      3:   -0.04
      4:    0.20
      5:   -0.30
      6:   -0.06
      7:   -0.88
      8:   -0.49
      9:   -0.83

MUL_MAT gpu=1
 VIEW gpu=1
  NONE gpu=1
 SOFT_MAX gpu=1
  MUL_MAT gpu=1
   VIEW gpu=1
    NONE gpu=1
   PERMUTE gpu=1
    RESHAPE gpu=1
     ADD gpu=1
      MUL_MAT gpu=1
       NONE gpu=1
       ADD gpu=1
        MUL gpu=1
         NORM gpu=1
          ADD gpu=1
         NONE gpu=1
        NONE gpu=1
      NONE gpu=1
C:\Users\...\whisper.cpp\ggml\src\ggml-vulkan.cpp:7367: fatal error

I'm unsure if this is related to the poor transcription or not, as I also get a similar issue on a GPU that gives a good transcription. The above message is from an AMD RX 7900 XT.

I can see that mul_mat_vec.comp contains some barrier() calls. But it also has an early return before this. As barriers() should be executed by all work items, this leads to undefined behaviour. It's possible this could be the cause of the bug?

SRHMorris avatar Aug 29 '24 15:08 SRHMorris

I can see that mul_mat_vec.comp contains some barrier() calls. But it also has an early return before this. As barriers() should be executed by all work items, this leads to undefined behaviour. It's possible this could be the cause of the bug?

I agree it was undefined behavior and might cause such a bug. I've removed this early return in https://github.com/ggerganov/llama.cpp/commit/772703c8fffdd83d2e28f60119e83525f1189412. Can you retry after that commit?

jeffbolznv avatar Nov 18 '24 18:11 jeffbolznv