When using llava-cli or the multimodal mode of the server on the vulkan backend, the language model generates gibberish output, regardless of the number of layers offloaded to GPU.

~I guess there is something going wrong when processing the image embeddings, which confuses the language model.~ I'm pretty sure the image embeddings are not generated properly.

Examples

(Using lllava 1.6 mistral q4_k_m with its f16 mmproj (running in 1.5 mode), but the same happens with other multimodal models)

llava-cli

./llava-cli.exe -m .\models\llava\llava-v1.6-mistral-7b.Q4_K_M.gguf --mmproj .\models\llava\v1.6-mmproj-model-f16.gguf --image "./image.png" -p "Describe the attached image in detail." -s 0 --temp 0.3 -ngl 99

Vulkan backend:

encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  1541.91 ms by CLIP (    2.68 ms per image patch)


``ïüôÇ

llama_print_timings:        load time =    4051.34 ms
llama_print_timings:      sample time =       0.71 ms /     7 runs   (    0.10 ms per token,  9817.67 tokens per second)
llama_print_timings: prompt eval time =    4068.00 ms /   621 tokens (    6.55 ms per token,   152.65 tokens per second)
llama_print_timings:        eval time =     183.36 ms /     7 runs   (   26.19 ms per token,    38.18 tokens per second)
llama_print_timings:       total time =    7718.43 ms /   628 tokens

CLBlast backend for comparison:

encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  2876.00 ms by CLIP (    4.99 ms per image patch)

 The image shows a robot with humanoid features, standing in front of a chalkboard. The robot has a metallic body and is equipped with what appears to be a pencil holder on its left arm. It has two arms, each ending in a hand that holds a pencil. The robot's head is round with a small screen at the top center. Its face includes two eyes and a mouth, giving it a friendly appearance.

The robot is positioned slightly to the right of the chalkboard, which is filled with mathematical equations and diagrams. On the chalkboard, there are various mathematical symbols such as plus signs, minus signs, and equal signs. The background suggests an educational setting, possibly a classroom or a study room, with a desk visible on the left side of the image.

The robot's pose is relaxed, with one hand resting on its hip while the other holds a pencil. It seems to be engaged in teaching or learning, given its position in front of the chalkboard and the presence of the pencil holder. The overall scene conveys a sense of curiosity and education.

llama_print_timings:        load time =    6585.78 ms
llama_print_timings:      sample time =      26.84 ms /   235 runs   (    0.11 ms per token,  8755.91 tokens per second)
llama_print_timings: prompt eval time =   10332.51 ms /   621 tokens (   16.64 ms per token,    60.10 tokens per second)
llama_print_timings:        eval time =   15362.48 ms /   235 runs   (   65.37 ms per token,    15.30 tokens per second)
llama_print_timings:       total time =   30347.34 ms /   856 tokens

Server

(default system prompt and characters, temperature set to 0)

Vulkan backend:

CLBlast backend:

Edit:

Also fails with the new "real" llava 1.6 support.

Feb 17 '24 09:02 stduhpf

Ok I did some debugging, and it seems that the generaition of the embeddings with CLIP is what is broken here. Wich confuses me, because as far as I can tell by glancing at the code, this should behave the exact same way as the CPU backend, since vulkan isn't even mentionned in clip.cpp.

Here are the 10 first embedding values of the same image when using different backend:

CPU:

embedding[0]=   -0.618760
embedding[1]=   -0.000194
embedding[2]=   -0.642467
embedding[3]=   -0.554067
embedding[4]=   -0.936747
embedding[5]=   0.174165
embedding[6]=   0.209959
embedding[7]=   -0.146537
embedding[8]=   -0.000779
embedding[9]=   -0.261529

CLBlast

embedding[0]=   -0.619080
embedding[1]=   0.000488
embedding[2]=   -0.642737
embedding[3]=   -0.553680
embedding[4]=   -0.936375
embedding[5]=   0.173905
embedding[6]=   0.209494
embedding[7]=   -0.146290
embedding[8]=   -0.000921
embedding[9]=   -0.261672

Vulkan

embedding[0]=   -1.072125
embedding[1]=   -1.206946
embedding[2]=   0.319142
embedding[3]=   -1.670123
embedding[4]=   -2.271779
embedding[5]=   1.247169
embedding[6]=   0.792215
embedding[7]=   -0.570173
embedding[8]=   0.390381
embedding[9]=   -0.901546

(results are unaffected by -ngl, which makes sense since clip isn't using GPU on any of these backends)

It's obvious that the results are very simillar between CLBlast and default CPU backends (and both are working as expected), while Vulkan seems completely different.

Feb 17 '24 19:02 stduhpf

Thanks, I'm seeing the same! I thought there was something wrong in my ollama implementation, but looks like there isn't and this is a problem with the vulkan backend in llama.cpp.

Feb 19 '24 15:02 ddpasa

I think the bug is confirmed by now. Tagging @ggerganov , maybe he knows the origin of this behaviour.

Feb 27 '24 20:02 ddpasa

Make sure all ops used in llava are exercised in test-backend-ops and the tests with Vulkan backend are passing

Feb 28 '24 08:02 ggerganov

Make sure all ops used in llava are exercised in test-backend-ops and the tests with Vulkan backend are passing

Thanks! I'll take a look. Can you give me the right pointers in code?

Since it looks like the embeddings are wrong, this looks like a bug in clip. Therefore I'm looking at here: https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/clip.cpp Is this the right place to look at?

First observation: I see an init for cublas and metal, but no such init for vulkan. For example:

#ifdef GGML_USE_CUBLAS
    new_clip->backend = ggml_backend_cuda_init(0);
    printf("%s: CLIP using CUDA backend\n", __func__);
#endif

#ifdef GGML_USE_METAL
    new_clip->backend = ggml_backend_metal_init();
    printf("%s: CLIP using Metal backend\n", __func__);
#endif

@ggerganov do we need a similar code for Vulkan?

Feb 28 '24 15:02 ddpasa

I tried to run CLIP with a Vulkan backend, and it failed with:

ggml_vulkan: Error: Missing op: ACC

It looks like the Vulkan backend does not support this operation needed for CLIP, so it runs on a CPU backend anyway.

This makes the problem even more mysterious. If CLIP is running on CPU anyway, why are we getting weird behaviour with Vulkan?

Feb 28 '24 15:02 ddpasa

If CLIP is running on CPU anyway, why are we getting weird behaviour with Vulkan?

Likely due to this:

https://github.com/ggerganov/llama.cpp/blob/78aacf36344df724cdca9f1e1af849b2d2519cb8/ggml.c#L15393-L15399

It's logic for offloading CPU ops to the Vulkan backend by copying the data back-and-forth. Probably there is a bug there

Feb 28 '24 15:02 ggerganov

Continuing to debug this. Enabling GGML_VULKAN_CHECK_RESULTS causes compilation to fail with:

llama.cpp/ggml-vulkan.cpp: In function ‘void ggml_vk_check_results_1(ggml_backend_vk_context*, ggml_compute_params*, ggml_tensor*)’:
llama.cpp/ggml-vulkan.cpp:5744:61: error: base operand of ‘->’ has non-pointer type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5744 |         if (extra->offset + tensor_size >= extra->buffer_gpu->size) {
      |                                                             ^~
llama.cpp/ggml-vulkan.cpp:5745:44: error: base operand of ‘->’ has non-pointer type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5745 |             tensor_size = extra->buffer_gpu->size - (extra->offset);
      |                                            ^~
llama.cpp/ggml-vulkan.cpp:5748:41: error: invalid initialization of reference of type ‘vk_buffer&’ {aka ‘std::shared_ptr<vk_buffer_struct>&’} from expression of type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5748 |         ggml_vk_buffer_read(ctx, extra->buffer_gpu, extra->offset, tensor_data, tensor_size);
      |                                  ~~~~~~~^~~~~~~~~~
llama.cpp/ggml-vulkan.cpp:1894:75: note: in passing argument 2 of ‘void ggml_vk_buffer_read(ggml_backend_vk_context*, vk_buffer&, size_t, void*, size_t)’
 1894 | static void ggml_vk_buffer_read(ggml_backend_vk_context * ctx, vk_buffer& src, size_t offset, void * dst, size_t size) {
      |                                                                ~~~~~~~~~~~^~~
make[3]: *** [CMakeFiles/ggml.dir/build.make:132: CMakeFiles/ggml.dir/ggml-vulkan.cpp.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:742: CMakeFiles/ggml.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2907: examples/server/CMakeFiles/ext_server.dir/rule] Error 2

Looks unrelated

Feb 28 '24 17:02 ddpasa

Continuing to debug this. Enabling GGML_VULKAN_CHECK_RESULTS causes compilation to fail with:

llama.cpp/ggml-vulkan.cpp: In function ‘void ggml_vk_check_results_1(ggml_backend_vk_context*, ggml_compute_params*, ggml_tensor*)’:
llama.cpp/ggml-vulkan.cpp:5744:61: error: base operand of ‘->’ has non-pointer type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5744 |         if (extra->offset + tensor_size >= extra->buffer_gpu->size) {
      |                                                             ^~
llama.cpp/ggml-vulkan.cpp:5745:44: error: base operand of ‘->’ has non-pointer type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5745 |             tensor_size = extra->buffer_gpu->size - (extra->offset);
      |                                            ^~
llama.cpp/ggml-vulkan.cpp:5748:41: error: invalid initialization of reference of type ‘vk_buffer&’ {aka ‘std::shared_ptr<vk_buffer_struct>&’} from expression of type ‘vk_buffer_ref’ {aka ‘std::weak_ptr<vk_buffer_struct>’}
 5748 |         ggml_vk_buffer_read(ctx, extra->buffer_gpu, extra->offset, tensor_data, tensor_size);
      |                                  ~~~~~~~^~~~~~~~~~
llama.cpp/ggml-vulkan.cpp:1894:75: note: in passing argument 2 of ‘void ggml_vk_buffer_read(ggml_backend_vk_context*, vk_buffer&, size_t, void*, size_t)’
 1894 | static void ggml_vk_buffer_read(ggml_backend_vk_context * ctx, vk_buffer& src, size_t offset, void * dst, size_t size) {
      |                                                                ~~~~~~~~~~~^~~
make[3]: *** [CMakeFiles/ggml.dir/build.make:132: CMakeFiles/ggml.dir/ggml-vulkan.cpp.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:742: CMakeFiles/ggml.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2907: examples/server/CMakeFiles/ext_server.dir/rule] Error 2

Looks unrelated

Fixed these errors here: https://github.com/ggerganov/llama.cpp/pull/5813

After fixing them, I see the following log when enabling GGML_VULKAN_CHECK_RESULTS and filtering for the largest errors, I get the following:

12 kq_soft_max_ext-18 op=SOFT_MAX backend=10 avg_err=0.0321356
35 kq_soft_max_ext-19 op=SOFT_MAX backend=10 avg_err=0.0282426
58 kq_soft_max_ext-20 op=SOFT_MAX backend=10 avg_err=0.0290884
81 kq_soft_max_ext-21 op=SOFT_MAX backend=10 avg_err=0.0291806
104 kq_soft_max_ext-22 op=SOFT_MAX backend=10 avg_err=0.0302923
127 kq_soft_max_ext-23 op=SOFT_MAX backend=10 avg_err=0.0200351
150 kq_soft_max_ext-24 op=SOFT_MAX backend=10 avg_err=0.0261448
173 kq_soft_max_ext-25 op=SOFT_MAX backend=10 avg_err=0.0229881
196 kq_soft_max_ext-26 op=SOFT_MAX backend=10 avg_err=0.0202856
219 kq_soft_max_ext-27 op=SOFT_MAX backend=10 avg_err=0.0198218
242 kq_soft_max_ext-28 op=SOFT_MAX backend=10 avg_err=0.0153048
265 kq_soft_max_ext-29 op=SOFT_MAX backend=10 avg_err=0.0140466
288 kq_soft_max_ext-30 op=SOFT_MAX backend=10 avg_err=0.0101896
311 kq_soft_max_ext-31 op=SOFT_MAX backend=10 avg_err=0.0320347
694 kq_soft_max_ext-18 op=SOFT_MAX backend=10 avg_err=0.0178342
717 kq_soft_max_ext-19 op=SOFT_MAX backend=10 avg_err=0.014998
740 kq_soft_max_ext-20 op=SOFT_MAX backend=10 avg_err=0.0170977
763 kq_soft_max_ext-21 op=SOFT_MAX backend=10 avg_err=0.017451
786 kq_soft_max_ext-22 op=SOFT_MAX backend=10 avg_err=0.0162033
809 kq_soft_max_ext-23 op=SOFT_MAX backend=10 avg_err=0.013214
832 kq_soft_max_ext-24 op=SOFT_MAX backend=10 avg_err=0.0143036
855 kq_soft_max_ext-25 op=SOFT_MAX backend=10 avg_err=0.0152926
878 kq_soft_max_ext-26 op=SOFT_MAX backend=10 avg_err=0.012964
901 kq_soft_max_ext-27 op=SOFT_MAX backend=10 avg_err=0.0139679
947 kq_soft_max_ext-29 op=SOFT_MAX backend=10 avg_err=0.0115121
993 kq_soft_max_ext-31 op=SOFT_MAX backend=10 avg_err=0.0195857

@ggerganov do you think the numbers above indicate a problem with the softmax implementation in Vulkan?

Mar 01 '24 12:03 ddpasa

Pinging @0cc4m for thoughts on this

Mar 01 '24 14:03 ggerganov

@ddpasa Thank you for looking into this and apologies for not replying to this issue earlier.

I haven't personally tested llava at all, I think there are ops missing. That softmax error is also too high, there's something wrong there. Which quant were you using? I'm preparing a larger update to the Vulkan backend in 0cc4m/vulkan-improvements. You could check if that branch already fixed this soft-max issue.

It might also be related to the alibi changes that I've yet to incorporate into Vulkan. I'll see if I can get im2col, acc and whatever other op is missing into the update too this weekend.

Mar 01 '24 17:03 0cc4m

@ddpasa Thank you for looking into this and apologies for not replying to this issue earlier.

I haven't personally tested llava at all, I think there are ops missing. That softmax error is also too high, there's something wrong there. Which quant were you using? I'm preparing a larger update to the Vulkan backend in 0cc4m/vulkan-improvements. You could check if that branch already fixed this soft-max issue.

It might also be related to the alibi changes that I've yet to incorporate into Vulkan. I'll see if I can get im2col, acc and whatever other op is missing into the update too this weekend.

Thanks @0cc4m !

I switched to your branch, and can indeed confirm that the softmax errors are lower. The next group of high level errors are on MUL_MAT, which seems to behave similarly in both branches. Some examples are below:

4 Qcur-18 op=MUL_MAT backend=10 avg_err=0.00314652
6 Kcur-18 op=MUL_MAT backend=10 avg_err=0.00361636
8 Vcur-18 op=MUL_MAT backend=10 avg_err=0.00255978
19 ffn_gate-18 op=MUL_MAT backend=10 avg_err=0.00156138
21 ffn_up-18 op=MUL_MAT backend=10 avg_err=0.00142493
27 Qcur-19 op=MUL_MAT backend=10 avg_err=0.00310669
29 Kcur-19 op=MUL_MAT backend=10 avg_err=0.00365042
31 Vcur-19 op=MUL_MAT backend=10 avg_err=0.00266333

(there are more, but I truncated them)

More importantly, the behaviour of llava is significantly changed! In mainline, I'm seeing empty or garbled output. On vulkan-improvements I get meaningful text, except that it's completely irrelevant to the image I provided. It's definitely a move in the right direction, but there still is something remaining. Maybe it's the MUL_MATs above.

Mar 01 '24 17:03 ddpasa

@0cc4m On your branch, the answers are no longer complete gibberish, but still either very incoherent, completely unrelated to the actual image, and/or in the wrong language. I can confirm @ddpasa's findings.

Example output:

 An image of a building de la a sala de una parece a unite, esta es una: La imagen de una parece a unite, "La imagen est ilude que mostra la salva como oeste, "La imagen estou ser alta a uma, 'La imagen estou ser um, "Un peice de la sala, 'Este parece a un peice, "La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'Este parece a un peice, "La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'Este parece a un peice, "La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou ser alta, 'La imagen estou

CPU:

 This is a movie still featuring a group of characters in what appears to be a rustic kitchen or dining room setting. There are six individuals visible, with four women and two men. The women are dressed in period costumes that suggest a historical or fantasy context. They seem to be engaged in conversation or some sort of gathering around a large wooden table.

The men are seated at the table, which is set with various dishes and utensils, indicating a mealtime scene. One man is holding a cup, possibly enjoying a drink. The room has a warm, inviting atmosphere with natural light coming in through windows on the left side of the image.

In the background, there are shelves filled with various items, including bottles and bowls, which contribute to the homely and lived-in feel of the scene. The overall mood of the image is one of camaraderie and shared experience among the characters.

Mar 01 '24 17:03 stduhpf

@0cc4m is there anything else we can do to help debug this? It looks like tha MUL_MATs have some error in them, but not sure if that is the root cause. Please let us know if we can test or try out something.

Mar 14 '24 13:03 ddpasa

@0cc4m is there anything else we can do to help debug this? It looks like tha MUL_MATs have some error in them, but not sure if that is the root cause. Please let us know if we can test or try out something.

No, not at the moment. I can reproduce it, but I haven't had the time to find and fix the issue. But it's on my todo list.

Mar 14 '24 19:03 0cc4m

@0cc4m , I just tried again with the latest mainline (1b67731) and the behaviour is much better. What was the issue?

I'll run a more detailed test in the following days, but vulkan behaviour looks much better now.

Apr 10 '24 22:04 ddpasa

I didn't intentionally fix this yet, but maybe I found and fixed some related issue in the meantime. Let me know when you have more details.

Apr 11 '24 06:04 0cc4m

I didn't intentionally fix this yet, but maybe I found and fixed some related issue in the meantime. Let me know when you have more details.

Just ran more test, it produces reasonable outputs. @stduhpf how about you?

Apr 11 '24 06:04 ddpasa

A related question: for embedded video cards, what limit for the memory should I use for vulkan? Is this something vulkaninfo can tell me?

For example, running with 3GB of memory works but 6GB crashes llama.cpp. My system has 16GB of main memory so it should be fine, unless vulkan can't access it for some reason.

Apr 11 '24 06:04 ddpasa

I didn't intentionally fix this yet, but maybe I found and fixed some related issue in the meantime. Let me know when you have more details.

Just ran more test, it produces reasonable outputs. @stduhpf how about you?

I'm still only getting giberish output on lastest commit. (https://github.com/ggerganov/llama.cpp/commit/8228b66d)

Apr 11 '24 11:04 stduhpf

Seems to be fixed for me since befddd0f15de6efb15d7e7f5b527dfb671f4196f

May 18 '24 17:05 stduhpf

llama.cpp
llama.cpp copied to clipboard

Llava generates gibberish on the Vulkan backend

Examples

llava-cli

Vulkan backend:

CLBlast backend for comparison:

Server

Vulkan backend:

CLBlast backend:

Edit:

CPU:

CLBlast

Vulkan

llama.cpp llama.cpp copied to clipboard

Llava generates gibberish on the Vulkan backend

Examples

llava-cli

Vulkan backend:

CLBlast backend for comparison:

Server

Vulkan backend:

CLBlast backend:

Edit:

CPU:

CLBlast

Vulkan

llama.cpp
llama.cpp copied to clipboard