llama.cpp Windows nvcc workaround

With my GPU changes in https://github.com/ggerganov/llama.cpp/pull/1703 Windows users were reporting to get garbage outputs: https://github.com/ggerganov/llama.cpp/issues/1735 . I think the problem is a bug in the Windows CUDA compiler regarding instruction reordering or optimizing out local variables. The problem can be fixed by either making a debug build or by adding assert statements. This alone would not be indicative of a bug in the compiler since it's possible that I'm simply accidentally relying on undefined behavior somewhere that randomly happens to work correctly unless the compiler does certain optimizations. However, I found that adding an assert statement for one of the variables that I use for determining indices affects another assert statement for another variable.

Specifically: my code loops over ne02 and ne03 of the ggml tensors via i02 and i03 and then does a computation for each i02 i03 bin. For multiple GPUs the edge between GPU data regions for split tensors falls into one of those bins. I'm using i01_low and ì01_high to represent the lower and upper rows of a bin that a given GPU should work on. For a single GPU i01_low == 0 and i01_high == ne01 should always be true. However, for some reason i01_high is becoming 0 on Windows which in turn means that the GPU isn't actually doing any work. I triple-checked my code for determining the indices and I'm confident that it's correct. The only possible problem could be nrows0 not being a multiple of GGML_CUDA_DMMV_Y but I've asserted that this is not the problem and the default value for GGML_CUDA_DMMV_Y is 1 anyways. I therefore think that the bug is caused by something outside llama.cpp and the most likely suspect is the compiler, especially since that is something that differs between Windows and Linux.

Feedback would be highly appreciated. In particular it would be useful for Windows users that experience the bug to check out this PR and to confirm whether the fix works. Also, please try disabling the first assert on line 1520 of ggml-cuda.cu and confirm that this causes an assertion error (if there is only a single GPU in the system).

Jun 08 '23 08:06 JohannesGaessler

Patch is working on my end and confirmed, commenting line 1520 causes an assertion:

GGML_ASSERT: ggml-cuda.cu:1521: i01_high == ne01 || g_device_count > 1

Jun 08 '23 12:06 dranger003

can i ask when this will be done and merged into main?

Jun 08 '23 14:06 IkariDevGIT

You can ask, but I can't tell you. Fundamentally PRs should be approved by someone other than the PR developer. I suspect that there is a compiler bug but I am not 100% certain so I would like input from other devs. Surely you will survive for another day or two without multi GPU support.

Jun 08 '23 14:06 JohannesGaessler

@JohannesGaessler im not using multiple gpu´s tho, im poor 😏

Jun 08 '23 14:06 IkariDevGIT

Then just use the last version before I did the GPU changes?

Jun 08 '23 14:06 JohannesGaessler

@JohannesGaessler im using oobabooga ui with gpu offload, how can i revert the changes?

Jun 08 '23 14:06 IkariDevGIT

No idea how oobabooga works.

Jun 08 '23 14:06 JohannesGaessler

@JohannesGaessler well done with this. It's a lot of hard work comparing different OS and versions to try and find that one little difference that finally reveals what the issue could be. It's easy to take open code for granted when it works, but people like you are really bringing us a lot of good. Keep it up!

I've been reading the previous conversations and all I can bring to the table is I'm compiling llama.cpp on Windows and as you identified it works on Debug, but not on Release.

Jun 08 '23 18:06 FlanGrande

Then just use the last version before I did the GPU changes?

If you can tell me how to tell llama-cpp-python to build for a specific commit of llama.cpp, I'd love to.

Jun 08 '23 23:06 Pathos14489

On Linux:

git clone llama-cpp-python.
Install llama-cpp-python as editable via pip3 install -e . in the llama-cpp-python git repository directory.
Check out llama.cpp revision master-44f906e or this PR.
Build a llama.cpp library via make LLAMA_CUBLAS=1 libllama.so.
Put the libllama.so file in the llama_cpp subdirectory of the llama-cpp-python git repository.

Jun 09 '23 06:06 JohannesGaessler

@ggerganov Can you please look over this PR? Even if you don't agree with my conclusions the changes are very simple and seem to fix the issue with garbled text that Windows users have. I would therefore prefer merging it sooner rather than later.

Jun 09 '23 10:06 JohannesGaessler

I don't understand the broader implications of the change, but it did fix the issue with previous models. I can confirm though newer models like vicuna-13b-cot.ggmlv3.q4_K_S.bin do not work for me, assuming they are supposed to.

Jun 09 '23 10:06 FlanGrande

Edit: this is incorrect.

I found:

Running q4_k_s on Linux using the master branch works as it should.
Running q4_k_s on Windows using the master branch results in gibberish.
Running q4_k_s on Windows using this PR results in a crash.
Running q4_k_s on Windows using 5c64a0952ee58b2d742ee84e8e3d43cce5d366db (the commit with which k-quants got merged) also results in a crash.

I said before that the problem with the master branch on Windows is that the matrix multiplications never actually get executed so the output is just gibberish. So this looks to me like for some reason q4_k_s is crashing on Windows during the matrix multiplication and that this is a separate bug that has nothing to do with my GPU changes. @ikawrakow are you aware of this issue?

Jun 09 '23 11:06 JohannesGaessler

@JohannesGaessler No, I'm not aware. I have tested the k-quants to work on x86_64 CPUs under Linux (scalar, AVX2), on the M2 under macOS (scalar, ARM_NEON), and on CUDA (RTX 4080 in a Linux box). Now there is a functioning implementation for M2 GPU on Metal.

I don't have access to a Windows box, so no way to try to figure out what the issue might be.

I have noticed in your PR description that you assume GGML_CUDA_DMMV_Y to be 1. The k-quants CUDA implementation does not use this, and instead always sets block_dims(32, 2, 1) with nrows/2 "rows".

Also, when you say that q4_K_S results in gibberish or a crash on Windows, is this running on the CPU or the GPU?

Jun 09 '23 13:06 ikawrakow

It seems I was mistaken. I had made a small modification to llama.cpp for debugging and I forgot to revert that modification when testing q4_k_s. With the unmodified master branch the format works correctly. Sorry for the false alarm.

Jun 09 '23 14:06 JohannesGaessler

llama.cpp llama.cpp copied to clipboard

Windows nvcc workaround

llama.cpp
llama.cpp copied to clipboard