Bad generation with GGUF and OpenAI api

Open ccdv-ai opened this issue 1 year ago • 1 comments

I tried to generate some text using a mixtral instruct GGUF model but the model only predicts nonsense. Something is either wrong with the tokenizer or the chat template. I tried to convert the model manually using this script but I get the same behavior.

python -m aphrodite.endpoints.openai.api_server  \
    --model "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" \
    --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1" \
    --quantization "gguf" \
    --port 8001 \
    --host 0.0.0.0 \
    --dtype "half" \
    --served-model-name mixtral \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --kv-cache-dtype auto \
    --seed 123 \
    --max-num-seqs 1 \
    --enforce-eager

Edit: using the pip package (v0.5.0) Edit2: building from source leads to this error

File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/vocab_parallel_embedding.py", line 123, in forward
    output_parallel = self.linear_method.apply_embedding(
  File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/quantization/gguf.py", line 152, in apply_embedding
    dequant = ops.ggml_dequantize(quant, weight_type, hidden_size,
RuntimeError: Unknown layout

Mar 13 '24 18:03 ccdv-ai

Can confirm this happens with mixtral. Investigating.

Mar 14 '24 02:03 AlpinDale