llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Bug: Qwen2-72B-Instruct (and finetunes) Q4_K_M, Q5_K_M generates random output with CuBLAS prompt processing

Open anunknowperson opened this issue 8 months ago • 25 comments

What happened?

Qwen2-72B-Instruct Q4_K_M generates output with random tokens (numbers, special symbols, random chunks of words from different languages, etc).

Has been tested on:

  1. Tesla P40 24gb + CPU partitioning with offloating half of the layers
  2. Inference fully on RAM (on another pc from 1)

Other people say it works with Q6, maybe the problem is with Q4_K_M (i can't test q6).

I've tried with both FlashAttention on and off and MMQ on and off, doesn't work.

I tested with llama.cpp binaries, koboldcpp, text-generation-webui - doesn't work everywhere.

related: https://github.com/LostRuins/koboldcpp/issues/909

image

Name and Version

version: 3181 (37bef894) built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

anunknowperson avatar Jun 20 '24 03:06 anunknowperson