llama.cpp
llama.cpp copied to clipboard
Bug: Qwen2-72B-Instruct (and finetunes) Q4_K_M, Q5_K_M generates random output with CuBLAS prompt processing
What happened?
Qwen2-72B-Instruct Q4_K_M generates output with random tokens (numbers, special symbols, random chunks of words from different languages, etc).
Has been tested on:
- Tesla P40 24gb + CPU partitioning with offloating half of the layers
- Inference fully on RAM (on another pc from 1)
Other people say it works with Q6, maybe the problem is with Q4_K_M (i can't test q6).
I've tried with both FlashAttention on and off and MMQ on and off, doesn't work.
I tested with llama.cpp binaries, koboldcpp, text-generation-webui - doesn't work everywhere.
related: https://github.com/LostRuins/koboldcpp/issues/909
Name and Version
version: 3181 (37bef894) built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
No response