Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

Significantly different results (and WRONG) inference when GPU is enabled.

If I remember correctly the output ``` #▅ $#" "! $ !!!" " $# "" ``` is effectively what you get when a `NO_DEVICE_CODE` isn't being correctly triggered. My intuition...

Significantly different results (and WRONG) inference when GPU is enabled.

Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0` and `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999`?

Significantly different results (and WRONG) inference when GPU is enabled.

When checking the NVCC version your shell prefix is `(llama_cpp_py39)`. When you actually run the model the prefix is `(pytorch_py39_cu11.8)`. Are you sure that in both cases CUDA 12 is...

Significantly different results (and WRONG) inference when GPU is enabled.

Also, I didn't mean to compile with both `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0` and `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999` at the same time. I meant to test either option individually. But if you still get incorrect results with...

Significantly different results (and WRONG) inference when GPU is enabled.

According to the HuggingFace repository, the model was made with llama.cpp revision `629f917`. Do you get correct results with that revision?

ggml : rewrite silu and softmax for cpu

After this PR has been merged the server has been producing nondeterministic results when using >1 slots. Minimal example for reproduction: ```bash make clean && make server ./server -m models/opt/llama_2-7b-q4_0.gguf...

Llama3-8b & Perplexity.exe Issue

https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity#perplexity >Once you have the file, supply perplexity with the quantized model, the logits file via --kl-divergence-base, and finally the --kl-divergence argument to indicate that the program should calculate the...