llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

with the newest builds i only get gibberish output

Open maddes8cht opened this issue 1 year ago • 44 comments

After the CUDA refactor PR #1703 by @JohannesGaessler was merged i wanted to try it out this morning and measure the performance difference on my ardware. I use my standard prompts with different models in different sizes.

I use the prebuild versions win-cublas-cu12.1.0-xx64

With the new builds I only get gibberish as a response for all prompts used and all models. It looks like a random mix of words in different languages.

On my current PC I can only use the win-avx-x64 version, here I still get normal output.

I will use the Cuda-pc again in a few hours, then I can provide sample output or more details. Am I the only one with this problem?

maddes8cht avatar Jun 07 '23 08:06 maddes8cht

Same, It gives gibberish output only when layers are offloaded to the gpu via -ngl. Without offload it works as it should.I had to roll back to then pre cuda refactor commit.

RahulVivekNair avatar Jun 07 '23 08:06 RahulVivekNair

Same here, this only happens when offloading layers to GPU and running on CPU works fine. Also, I noticed the more GPU layers you have the more gibberish you get.

dranger003 avatar Jun 07 '23 11:06 dranger003

Thank you for reporting this issue. Just to make sure: are you getting garbage outputs with all model sizes, even with 7b?

JohannesGaessler avatar Jun 07 '23 14:06 JohannesGaessler

Also to make sure: is there anyone with this issue that is compiling llama.cpp themselves or is everyone using the precompiled Windows binary?

JohannesGaessler avatar Jun 07 '23 14:06 JohannesGaessler

I tried different models and model sizes, and they all produce gibberish using GPU layers but work fine using CPU. Also, I am compiling from the latest commit on the master branch, using Windows and cmake.

dranger003 avatar Jun 07 '23 14:06 dranger003

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

JohannesGaessler avatar Jun 07 '23 15:06 JohannesGaessler

Yes agree with @dranger003 above, local compile does not fix the issue. I also tried the cublas and clblas both options produce gibberish. I only have one GPU. Do I need any new command line options?

Will try the the change that @JohannesGaessler suggests above.

FlareP1 avatar Jun 07 '23 15:06 FlareP1

Thank you for reporting this issue. Just to make sure: are you getting garbage outputs with all model sizes, even with 7b?

I've tried it with all the types of quantizations and model sizes. Still produces some weird gibberish output.

RahulVivekNair avatar Jun 07 '23 15:06 RahulVivekNair

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

Same issue on my end with this change.

diff --git a/llama.cpp b/llama.cpp
index 16d6f6e..e06d503 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -1155,7 +1155,7 @@ static void llama_model_load_internal(

         (void) vram_scratch;
 #ifdef GGML_USE_CUBLAS
-        vram_scratch = n_batch * MB;
+        vram_scratch = 4 * n_batch * MB;
         ggml_cuda_set_scratch_size(vram_scratch);
         if (n_gpu_layers > 0) {
             fprintf(stderr, "%s: allocating batch_size x 1 MB = %ld MB VRAM for the scratch buffer\n",

dranger003 avatar Jun 07 '23 15:06 dranger003

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

This does not fix the issue for me.

main: build = 635 (5c64a09)
main: seed  = 1686151114
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080
llama.cpp: loading model from ../models/GGML/selfee-13b.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 6820.77 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 2048 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 layers to GPU
llama_model_load_internal: total VRAM used: 6587 MB
..................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 5 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.730000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


[33m ### Instruction:\
Write a detailed account on the benefits and challenges of using automated assistants based on LLMs.  Suggest how LLMs are likely to be used in the near future. What effect this will have on employment and skills needs in the workplace.  How will businesses need to adapt and evolve to maximise the benefits from this technology.\
### Response:
[0m benefits &amp Vertigowebsitesearch engines Search engines Search google search Google searchGoogle search engineGoogle search engine Google search engine Google search engineiallyikuwaμClientele clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele ClienteleClientele Clientele ClienteleClienteleClienteleClientele Clientele ClienteleClientele clientele Clientele ClienteleClienteleClientele Clientele Clientele ClienteleClienteleClientele ClienteleClientele Clientele ClienteleClientele Clientele ClienteleClientele Clientele ClienteleClientilehiyawaifikMT machine learning Sebastianurity securitysecurity Security Security Securityintegration integration integration integration integration integration integration integration integration integration integration integration integration integration integrationintegration integration integration integration integration integration Linkexchangeabletonikonidéortodoxyfit wittersburgidé修 connaissanceable magnpackageeuropewo meshnetworkedayoutWEvikipediawikiidéangrhythmembergelesupportente Witmaternalismsavedrabblementқreb

FlareP1 avatar Jun 07 '23 15:06 FlareP1

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

JohannesGaessler avatar Jun 07 '23 15:06 JohannesGaessler

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

Thanks, this is the command I use to compile on my end.

cmake -DLLAMA_CUBLAS=ON . && cmake --build . --config Release

dranger003 avatar Jun 07 '23 15:06 dranger003

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

Is it working as intended on Linux?

RahulVivekNair avatar Jun 07 '23 15:06 RahulVivekNair

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

JohannesGaessler avatar Jun 07 '23 15:06 JohannesGaessler

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model

However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.

Which commit do I need to pull to try a re-build before the issue occured?

FlareP1 avatar Jun 07 '23 15:06 FlareP1

Did you revert the change that increases the size of the VRAM scratch buffer? In any case, since the GPU changes that I did are most likely the problem the last good commit should be 44f906e8537fcec965e312d621c80556d6aa9bec.

JohannesGaessler avatar Jun 07 '23 15:06 JohannesGaessler

Did you revert the change that increases the size of the VRAM scratch buffer? In any case, since the GPU changes that I did are most likely the problem the last good commit should be 44f906e8537fcec965e312d621c80556d6aa9bec.

I don't have a Linux install to test at the moment, but on Windows I confirm commit 44f906e8537fcec965e312d621c80556d6aa9bec works fine with all GPU layers offloaded.

dranger003 avatar Jun 07 '23 15:06 dranger003

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model

However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.

Which commit do I need to pull to try a re-build before the issue occured?

Could this be related?

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) https://github.com/ggerganov/llama.cpp/issues/1230

dranger003 avatar Jun 07 '23 16:06 dranger003

I have reverted the changes and checked out the 44f906e8537fcec965e312d621c80556d6aa9bec commit. On my version of WSL2 this still does not work and gives the same out of memory error so I guess I probably have a WSL / cuda setup issue.

On Windows I can compile and the code works fine from this commit.

FlareP1 avatar Jun 07 '23 16:06 FlareP1

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

Working fine on WSL2 (Ubuntu) using CUDA on commit 5c64a09. So on my end this is a Windows only bug it seems.

dranger003 avatar Jun 07 '23 16:06 dranger003

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok. Which commit do I need to pull to try a re-build before the issue occured?

Could this be related?

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Awesome tip of thanks.

Due to current cuda bug you need to set no pinned for enviroment variables. Command for it: export GGML_CUDA_NO_PINNED=1

Now the old commit works under WSL, will try the latest again

UPDATE: Yes works fine on the latest commit under WSL2, as long as you disabled pinned memory.

FlareP1 avatar Jun 07 '23 16:06 FlareP1

Can confirm, ran under WSL and the output is as expected. Something wrong only on the windows side with the gibberish output.

RahulVivekNair avatar Jun 07 '23 17:06 RahulVivekNair

I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have.

JohannesGaessler avatar Jun 07 '23 17:06 JohannesGaessler

I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have.

If it helps I am using the following config.

Microsoft Windows [Version 10.0.19044.2965]

>nvidia-smi
Wed Jun  7 19:48:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.61                 Driver Version: 531.61       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080       WDDM | 00000000:01:00.0  On |                  N/A |
| 40%   27C    P8               16W / 320W|    936MiB / 10240MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

FlareP1 avatar Jun 07 '23 18:06 FlareP1

And here's mine.

Microsoft Windows [Version 10.0.22621.1778]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P8              32W / 420W |  12282MiB / 24576MiB |     44%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

And also this one.

Microsoft Windows [Version 10.0.22621.1702]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8               4W /  80W |    265MiB /  8192MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

dranger003 avatar Jun 07 '23 19:06 dranger003

here it's

Microsoft Windows [Version 10.0.19045.3031]
(c) Microsoft Corporation. Alle Rechte vorbehalten.

C:\Users\Mathias>nvidia-smi
Wed Jun  7 21:20:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.61                 Driver Version: 531.61       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060       WDDM | 00000000:0A:00.0  On |                  N/A |
|  0%   41C    P8               16W / 170W|    704MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

maybe it is a windows 10 thing? @JohannesGaessler are you using Windows 10 or windows 11? Anyone having problems using Windows 11? Just a guess..

maddes8cht avatar Jun 07 '23 19:06 maddes8cht

maybe it is a windows 10 thing? @JohannesGaessler are you using Windows 10 or windows 11? Anyone having problems using Windows 11? Just a guess..

I'm on Windows 11 and I have the issue, build 10.0.22621.1778 is Windows 11 btw.

dranger003 avatar Jun 07 '23 19:06 dranger003

I'm now able to reproduce the issue. On my system it only occurs when I use the --config Release option. If I make a debug build by omitting the option the program produces correct results.

JohannesGaessler avatar Jun 07 '23 19:06 JohannesGaessler

I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have.

@JohannesGaessler does the pre-bulit windows release work for you or only your local compile? Wondering if it is a compiler configuration issue on windows that is also effecting the pre-built binaries?

Is there anyone else that can run the latest cuda builds on windows without this error?

EDIT: Oh I see you found part of the issue, so is it an uninitialised variable or somthing that is diff with a release build

FlareP1 avatar Jun 07 '23 19:06 FlareP1

Same problem here:

Windows 10 Version 10.0.19045.2846

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   35C    P8              13W / 290W |    635MiB /  8192MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Folko-Ven avatar Jun 07 '23 19:06 Folko-Ven