vllm Load Mixtral 8x7b AWQ model failed

I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below:

INFO 12-24 09:22:55 llm_engine.py:73] Initializing an LLM engine with config: model='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)
(RayWorkerVllm pid=2491) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
(RayWorkerVllm pid=2491)   warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 729, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/workspace/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=2492, ip=172.17.0.2, actor_id=ccdc00b5ccaf06b948a44c5301000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f3cba935990>)
  File "/workspace/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "/workspace/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
  File "/workspace/vllm/model_executor/model_loader.py", line 72, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/workspace/vllm/model_executor/models/mixtral.py", line 430, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.26.block_sparse_moe.experts.0.w2.qweight'

Dec 24 '23 09:12 thiner

I have an example script that works with Mixtral:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

Dec 24 '23 22:12 casper-hansen

I have an example script that works with Mixtral:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

checking it right now https://github.com/casper-hansen/AutoAWQ/blob/main/examples/mixtral_quant.py, I hope this is the configuration you added to your model at hf https://huggingface.co/casperhansen/mixtral-instruct-awq

not working same error

[WARN ] PyProcess - W-181-model-stderr: KeyError: 'model.layers.0.block_sparse_moe.experts.0.w1.qweight'

maybe share your requirements.txt

Dec 25 '23 07:12 orellavie1212

I thought as the solution for general mixtral (not quantize gptq or awq, just regular one) was via .PT https://huggingface.co/IbuNai/Mixtral-8x7B-v0.1-gptq-4bit-pth/tree/main even that which is .bin (not found .pt in hf) same problem

Dec 25 '23 10:12 orellavie1212

I just used the following Docker image and ran pip install vllm

runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

Dec 25 '23 10:12 casper-hansen

I just used the following Docker image and ran pip install vllm

runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

I am using djl container v25 with same setup (py3.10, torch 2.1.1, cuda 12.1)

Dec 25 '23 10:12 orellavie1212

Could you try the Docker image I referenced to see if it's an environment issue?

Dec 25 '23 11:12 casper-hansen

Could you try the Docker image I referenced to see if it's an environment issue?

tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker

Dec 27 '23 10:12 zsplus

Could you try the Docker image I referenced to see if it's an environment issue?

tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker

I am using too tp=4, which is failing

Dec 27 '23 12:12 orellavie1212

Not sure if this relates to #2203. Does it work in FP16 with TP > 1?

Dec 27 '23 12:12 casper-hansen

Not sure if this relates to #2203. Does it work in FP16 with TP > 1?

tried also fp16 besides auto

Dec 27 '23 12:12 orellavie1212

I have the same problem when TP = 2.

Dec 27 '23 17:12 kk3dmax

Tagging @WoosukKwon @zhuohan123 for visibility. Seems Mixtral has issues with TP > 1 when using AWQ.

Dec 28 '23 17:12 casper-hansen

I'm also having this issue after a fresh quantization of Mixtral 8x7b instruct. There is no issue when running directly with AutoAWQ across multiple GPUs. Only when using vLLM across multiple GPUs does the error occur.

Example failing vLLM code

from vllm import LLM

llm = LLM("mistralai_Mixtral-8x7B-Instruct-v0.1-awq", quantization="AWQ", tensor_parallel_size=4)
outputs = llm.generate("Hello my name is")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Example working AutoAWQ code

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "mistralai_Mixtral-8x7B-Instruct-v0.1-awq"
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)

tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

text = "Hello my name is"
tokens = tokenizer(text, return_tensors="pt").input_ids.cuda()
generation_output = model.generate(
    tokens,
    streamer=streamer,
    max_new_tokens=512
)

Quantization code

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1-awq'
modules_to_not_convert = ["gate"]
quant_config = {
    "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
    "modules_to_not_convert": modules_to_not_convert
}

# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, safetensors=True, device_map="cpu", **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(
    tokenizer,
    quant_config=quant_config,
    modules_to_not_convert=modules_to_not_convert
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

Jan 03 '24 04:01 iibw

I'm also having this issue with the AWQ model

Jan 03 '24 15:01 MileSquareDevelopers

+1 with TP=2

Jan 03 '24 23:01 floleuerer

Woosuk said it should be fixed in the new 0.2.7 by PR #2208. Could someone verify with AWQ version?

Reference: https://github.com/vllm-project/vllm/issues/2332#issuecomment-18761736055

Jan 04 '24 11:01 casper-hansen

I don't have AWQ model currently, but tested with GPTQ model, and it's working fine now!

Jan 04 '24 16:01 thiner

@casper-hansen and @thiner I can confirm the Mixtral models load in both AWQ and GPTQ

Jan 04 '24 18:01 MileSquareDevelopers

Actually now the model loads, but I can't get any token processed.

When I do llm.generate(prompts), it just hangs.

Jan 04 '24 19:01 MileSquareDevelopers

I was able to get both working tp=4 with GPTQ and AWQ. It took a long time to load the model in my case, but eventually, it loaded and then generation happened instantly.

@MileSquareDevelopers maybe you need to wait a bit longer for it to load? If you put a print statement between the code that loads the LLM and the llm.generate code, you'll probably see it's never printed out and the code never reaches llm.generate.

Jan 04 '24 20:01 iibw

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output.

Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

Jan 04 '24 21:01 joennlae

I used my own AWQ quantization. Try quantizing it yourself and maybe that will fix the problem.

Jan 04 '24 21:01 iibw

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output.

Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

Can you try with my vLLM offline example?

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

Jan 04 '24 21:01 casper-hansen

@casper-hansen I can confirm that works. for me at least

Jan 05 '24 02:01 kniteli

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output. Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

Can you try with my vLLM offline example?

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

I tested your model. It works 😄 Thank you very much.

I do believe you have an issue with the tokenizer. If I have something that generates a number and it includes the number 2 somewhere (the same number as the eos_token), it finishes the generation right then and there. Same issue when I use the default mixtral tokenizer...

python -m vllm.entrypoints.openai.api_server --model="casperhansen/mixtral-instruct-awq"  --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2 --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1"

Regarding the tokenizer, I did some digging:

Here: https://github.com/vllm-project/vllm/blob/937e7b7d7c460c00805ac358a4873ec0653ab2f5/vllm/engine/llm_engine.py#L764

        # Check if the sequence has generated the EOS token.
        if ((not sampling_params.ignore_eos)
                and seq.get_last_token_id() == self.tokenizer.eos_token_id):
            seq.status = SequenceStatus.FINISHED_STOPPED
            return

seq.get_last_token_id()

For 2 is equals 2 where it is 28750 with the mistralai/Mixtral-8x7B-Instruct-v0.1 tokenizer.

Jan 05 '24 12:01 joennlae

im running into similiar issue with latest stable release on 2x 4090s

python -m vllm.entrypoints.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
 --dtype auto --tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1 \
 --quantization awq --trust-remote-code \
 --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --enforce-eager

The server never fully loads. just hangs on

WARNING 01-05 17:01:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-05 17:01:34,848	INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-05 17:01:36 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=True, seed=0)

Jan 05 '24 17:01 eschmidbauer

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

Mar 18 '24 10:03 thomasfloqs

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

It is not an issue with the tokenizer. I saw that there is a high chance for Mixtral to generate an end token, especially if dates/numbers are involved. I tried to do some investigation, but I stopped.

Some of the results from back then can be found here: https://github.com/joennlae/vllm/blob/019ee402923d43cb225afaf356d559556d615aef/write_up.md

Also I was not able to reproduce this issue with TGI.

Mar 18 '24 11:03 joennlae

vllm vllm copied to clipboard

Load Mixtral 8x7b AWQ model failed

[WARN ] PyProcess - W-181-model-stderr: KeyError: 'model.layers.0.block_sparse_moe.experts.0.w1.qweight'

vllm
vllm copied to clipboard