aphrodite-engine
aphrodite-engine copied to clipboard
Bad generation with GGUF and OpenAI api
Hi
I tried to generate some text using a mixtral instruct GGUF model but the model only predicts nonsense. Something is either wrong with the tokenizer or the chat template. I tried to convert the model manually using this script but I get the same behavior.
python -m aphrodite.endpoints.openai.api_server \
--model "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" \
--tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1" \
--quantization "gguf" \
--port 8001 \
--host 0.0.0.0 \
--dtype "half" \
--served-model-name mixtral \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--kv-cache-dtype auto \
--seed 123 \
--max-num-seqs 1 \
--enforce-eager
Edit: using the pip package (v0.5.0) Edit2: building from source leads to this error
File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/vocab_parallel_embedding.py", line 123, in forward
output_parallel = self.linear_method.apply_embedding(
File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/quantization/gguf.py", line 152, in apply_embedding
dequant = ops.ggml_dequantize(quant, weight_type, hidden_size,
RuntimeError: Unknown layout
Can confirm this happens with mixtral. Investigating.