Johannes Gäßler

Results 235 comments of Johannes Gäßler

If I remember correctly the output ``` #▅ $#" "! $ !!!" " $# "" ``` is effectively what you get when a `NO_DEVICE_CODE` isn't being correctly triggered. My intuition...

Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0` and `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999`?

When checking the NVCC version your shell prefix is `(llama_cpp_py39)`. When you actually run the model the prefix is `(pytorch_py39_cu11.8)`. Are you sure that in both cases CUDA 12 is...

Also, I didn't mean to compile with both `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0` and `LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999` at the same time. I meant to test either option individually. But if you still get incorrect results with...

According to the HuggingFace repository, the model was made with llama.cpp revision `629f917`. Do you get correct results with that revision?

After this PR has been merged the server has been producing nondeterministic results when using >1 slots. Minimal example for reproduction: ```bash make clean && make server ./server -m models/opt/llama_2-7b-q4_0.gguf...

https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity#perplexity >Once you have the file, supply perplexity with the quantized model, the logits file via --kl-divergence-base, and finally the --kl-divergence argument to indicate that the program should calculate the...

If you want to dig into this more, look at the [GCC compiler flags](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-Ofast) enabled by `-Ofast` and try to isolate which one is causing issues.

What could be happening is that the exponential function in SiLU, instead of flushing small values to 0, returns NaN or some other garbage. I've essentially had this same issue...

>If you want us to stop using infinity and start employing ugly workarounds instead, it'd help if you could communicate exactly what we stand to gain. In my particular case...