GPTQModel icon indicating copy to clipboard operation
GPTQModel copied to clipboard

[BUG]Qwen3-Omni-30B-A3B saved quantized model can't be loaded by vllm

Open allerou4 opened this issue 1 month ago • 3 comments

Describe the bug 1. ValueError: There is no module or parameter named 'audio_tower.positional_embedding.positional_embedding' in Qwen3OmniMoeThinkerForConditionalGeneration (EngineCore_DP0 pid=18809) Process EngineCore_DP0:

I can solve this by remove audio_tower.positional_embedding.positional_embedding in the safetensors.

vllm/vllm/model_executor/models/qwen3_moe.py", line 574, in load_weights [rank0]: param = params_dict[name_mapped] [rank0]: KeyError: 'layers.0.mlp.experts.w2_weight' This is because we have quantized expert and not quantized expert in the same layer, but vllm doesn't support mixed precision in a moe layer. see: layers.0.mlp.experts.102.down_proj.weight layers.0.mlp.experts.1.down_proj.qweight

vllm/vllm/model_executor/models/qwen3_moe.py", line 624, in load_weights [rank0]: param = params_dict[name] [rank0]: ~~~~~~~~~~~^^^^^^ [rank0]: KeyError: 'layers.0.self_attn.qkqkv_proj.g_idx'

name: layers.0.self_attn.k_proj.g_idx name replace: layers.0.self_attn.qkv_proj.g_idx line 539 continue name replace: layers.0.self_attn.qkqkv_proj.g_idx line 539 continue last name: layers.0.self_attn.qkqkv_proj.g_idx

I haven't figured out how this is generated, more about this one later. In conclusion, the saved quantized model's key and parameter doesn't fit latest vllm but I found this one works: https://huggingface.co/cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

GPU Info

Show output of:

| 0 NVIDIA RTX A3000 Laptop GPU On | 00000000:01:00.0 On | N/A |
| N/A 55C P8 19W / 80W | 5902MiB / 6144MiB | 0% Default |
| | | N/A |

Software Info

OS: ubuntu22.04 wsl python: 3.10

Name: GPTQModel Version: 5.4.4

Name: torch Version: 2.9.1+cu126

Name: transformers Version: 4.57.1

Name: accelerate Version: 1.10.1

Name: triton Version: 3.5.1

Name: vllm Version: 0.11.2

If you are reporting an inference bug of a post-quantized model, please post the content of config.json and quantize_config.json.

config.json: { "architectures": [ "Qwen3OmniMoeForConditionalGeneration" ], "assistant_token_id": 77091, "code2wav_config": { "attention_bias": false, "attention_dropout": 0.0, "codebook_dim": 512, "codebook_size": 2048, "decoder_dim": 1536, "dtype": "bfloat16", "hidden_act": "silu", "hidden_size": 1024, "intermediate_size": 3072, "layer_scale_initial_scale": 0.01, "max_position_embeddings": 8000, "model_type": "", "num_attention_heads": 16, "num_hidden_layers": 1, "num_key_value_heads": 16, "num_quantizers": 16, "num_semantic_quantizers": 1, "rms_norm_eps": 1e-05, "rope_theta": 10000, "semantic_codebook_size": 4096, "sliding_window": 72, "upsample_rates": [ 8, 5, 4, 3 ], "upsampling_ratios": [ 2, 2 ], "vector_quantization_hidden_dimension": 512 }, "dtype": "bfloat16", "enable_audio_output": true, "im_end_token_id": 151645, "im_start_token_id": 151644, "model_type": "qwen3_omni_moe", "quantization_config": { "bits": 4, "checkpoint_format": "gptq", "desc_act": false, "group_size": 128, "lm_head": false, "meta": { "act_group_aware": true, "damp_auto_increment": 0.01, "damp_percent": 0.05, "mse": 0.0, "quantizer": [ "gptqmodel:5.1.0-dev" ], "static_groups": false, "true_sequential": true, "uri": "https://github.com/modelcloud/gptqmodel", "v2": false, "v2_alpha": 0.25 }, "pack_dtype": "int32", "pack_impl": "cpu", "quant_method": "gptq", "sym": true }, "system_token_id": 8948, "talker_config": { "accept_hidden_layer": 1, "audio_end_token_id": 151670, "audio_start_token_id": 151669, "audio_token_id": 151675, "code_predictor_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": null, "attention_bias": false, "attention_dropout": 0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "dtype": null, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "head_dim": 128, "hidden_act": "silu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_types": [ "full_attention" ], "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 1, "min_length": 0, "model_type": "qwen3_omni_moe_talker_code_predictor", "no_repeat_ngram_size": 0, "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_code_groups": 16, "num_hidden_layers": 1, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torchscript": false, "typical_p": 1.0, "use_bfloat16": false, "use_cache": true, "use_sliding_window": false, "vocab_size": 2048 }, "codec_bos_id": 2149, "codec_eos_token_id": 2150, "codec_nothink_id": 2155, "codec_pad_id": 2148, "codec_think_bos_id": 2156, "codec_think_eos_id": 2157, "dtype": "bfloat16", "image_token_id": 151655, "model_type": "", "num_code_groups": 16, "output_router_logits": false, "position_id_per_seconds": 13, "seconds_per_chunk": 2, "spatial_merge_size": 2, "speaker_id": { "aiden": 2303, "chelsie": 2301, "ethan": 2302 }, "text_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": null, "attention_bias": false, "attention_dropout": 0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_sparse_step": 1, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "dtype": null, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "head_dim": 128, "hidden_act": "silu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 2048, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 65536, "min_length": 0, "mlp_only_layers": [], "model_type": "qwen3_omni_moe_talker_text", "moe_intermediate_size": 384, "no_repeat_ngram_size": 0, "norm_topk_prob": true, "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_experts": 128, "num_experts_per_tok": 6, "num_hidden_layers": 1, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_router_logits": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "interleaved": true, "mrope_section": [ 24, 20, 20 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000, "router_aux_loss_coef": 0.001, "sep_token_id": null, "shared_expert_intermediate_size": 768, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torchscript": false, "typical_p": 1.0, "use_bfloat16": false, "use_cache": true, "use_sliding_window": false, "vocab_size": 3072 }, "thinker_hidden_size": 2048, "video_token_id": 151656, "vision_start_token_id": 151652 }, "thinker_config": { "audio_config": { "_name_or_path": "", "activation_dropout": 0, "activation_function": "gelu", "add_cross_attention": false, "architectures": null, "attention_dropout": 0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "conv_chunksize": 500, "cross_attention_hidden_size": null, "d_model": 1280, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "downsample_hidden_size": 480, "dropout": 0, "dtype": null, "early_stopping": false, "encoder_attention_heads": 20, "encoder_ffn_dim": 5120, "encoder_layers": 1, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_source_positions": 1500, "min_length": 0, "model_type": "qwen3_omni_moe_audio_encoder", "n_window": 50, "n_window_infer": 800, "no_repeat_ngram_size": 0, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 1, "num_mel_bins": 128, "num_return_sequences": 1, "output_attentions": false, "output_dim": 2048, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "scale_embedding": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torchscript": false, "typical_p": 1.0, "use_bfloat16": false }, "audio_end_token_id": 151670, "audio_start_token_id": 151669, "audio_token_id": 151675, "dtype": "bfloat16", "image_token_id": 151655, "initializer_range": 0.02, "model_type": "qwen3_omni_moe_thinker", "position_id_per_seconds": 13, "seconds_per_chunk": 2, "text_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": null, "attention_bias": false, "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_sparse_step": 1, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "dtype": null, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "head_dim": 128, "hidden_act": "silu", "hidden_size": 2048, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 768, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 65536, "min_length": 0, "mlp_only_layers": [], "model_type": "qwen3_omni_moe_text", "moe_intermediate_size": 768, "no_repeat_ngram_size": 0, "norm_topk_prob": true, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_experts": 128, "num_experts_per_tok": 8, "num_hidden_layers": 1, "num_key_value_heads": 4, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_router_logits": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "interleaved": true, "mrope_interleaved": true, "mrope_section": [ 24, 20, 20 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000, "router_aux_loss_coef": 0.001, "sep_token_id": null, "shared_expert_intermediate_size": 0, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torchscript": false, "typical_p": 1.0, "use_bfloat16": false, "use_cache": true, "use_qk_norm": true, "use_sliding_window": false, "vocab_size": 152064 }, "user_token_id": 872, "video_token_id": 151656, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "apply_vit_abs_pos_embed": true, "architectures": null, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "deepstack_visual_indexes": [ 8, 16, 24 ], "depth": 27, "diversity_penalty": 0.0, "do_sample": false, "dtype": null, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu_pytorch_tanh", "hidden_size": 1152, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 768, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 4304, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "qwen3_omni_moe_vision_encoder", "no_repeat_ngram_size": 0, "num_beam_groups": 1, "num_beams": 1, "num_heads": 16, "num_position_embeddings": 2304, "num_return_sequences": 1, "out_hidden_size": 2048, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 16, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "spatial_merge_size": 2, "spatial_patch_size": 16, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "temporal_patch_size": 2, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "tokens_per_second": 2, "top_k": 50, "top_p": 1.0, "torchscript": false, "typical_p": 1.0, "use_bfloat16": false }, "vision_end_token_id": 151653, "vision_start_token_id": 151652 }, "transformers_version": "4.57.1", "tts_bos_token_id": 151672, "tts_eos_token_id": 151673, "tts_pad_token_id": 151671, "use_cache": false, "user_token_id": 872 }

quantize_config.json: { "bits": 4, "group_size": 128, "desc_act": false, "sym": true, "lm_head": false, "quant_method": "gptq", "checkpoint_format": "gptq", "pack_dtype": "int32", "meta": { "quantizer": [ "gptqmodel:5.4.4" ], "uri": "https://github.com/modelcloud/gptqmodel", "damp_percent": 0.05, "damp_auto_increment": 0.01, "static_groups": false, "true_sequential": true, "mse": 0.0, "v2": false, "v2_alpha": 0.25, "act_group_aware": true }, "pack_impl": "cpu" }

To Reproduce

My quant script:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "qwen3-omni-layer1"
quant_path = "qwen3-omni-layer1-GPTQ-4bit"

calibration_dataset = load_dataset(
    "json",
    data_files="c4-train.00001-of-01024.json.gz",
    split="train"
  )

calibration_dataset = calibration_dataset.filter(lambda x: len(x["text"]) <= 1024)
calibration_dataset = calibration_dataset.select(range(1))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128, vram_strategy="balanced")

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

inference with vllm:

import os
import torch

from vllm import LLM, SamplingParams
import os
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"


if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "qwen3-omni-layer1-GPTQ-4bit-no-talker"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.8,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

Expected behavior

vllm loads it successfully.

Model/Datasets

model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct dataset: https://huggingface.co/datasets/allenai/c4/blob/main/en/c4-train.00001-of-01024.json.gz workable quantized model: https://huggingface.co/cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

allerou4 avatar Nov 27 '25 12:11 allerou4

regarding the third one, solved by adding this config: """json "modules_in_block_to_quantize": [ "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "self_attn.qkv_proj", "mlp.gate" ], "packed_modules_mapping": { "self_attn.qkv_proj": [ "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj" ] """ see: https://github.com/vllm-project/vllm/pull/25455#issuecomment-3343836290

allerou4 avatar Nov 28 '25 09:11 allerou4

@allerou4 Can you check if it works without the packed config qkv_proj? vllm should be doing it's own qkv fusing and should need us to declare this. This part of it seems strange to me. Thanks. So also remove the packed_modules_mapping entirely.

modules_in_block_to_quantize": [
"self_attn.q_proj",
"self_attn.k_proj",
"self_attn.v_proj",
"self_attn.o_proj",
"mlp.gate"
],

Qubitium avatar Nov 28 '25 12:11 Qubitium

@allerou4 Can you check if it works without the packed config qkv_proj? vllm should be doing it's own qkv fusing and should need us to declare this. This part of it seems strange to me. Thanks. So also remove the packed_modules_mapping entirely.

modules_in_block_to_quantize": [ "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.gate" ],

Hi, we need the following block for vllm to load. we don't need "packed_modules_mapping"

  "modules_in_block_to_quantize": [
    "self_attn.q_proj",
    "self_attn.k_proj",
    "self_attn.v_proj",
    "self_attn.o_proj",
    "self_attn.qkv_proj",
    "mlp.gate"
  ]

But how to solve this mixed precision problem? In layer2, only expert 105 skipped quantization, but vllm don't support this:

thinker.model.layers.2.mlp.experts.99.down_proj.qweight thinker.model.layers.2.mlp.experts.99.down_proj.qzeros thinker.model.layers.2.mlp.experts.99.down_proj.scales thinker.model.layers.2.mlp.experts.99.gate_proj.g_idx thinker.model.layers.2.mlp.experts.99.gate_proj.qweight thinker.model.layers.2.mlp.experts.99.gate_proj.qzeros thinker.model.layers.2.mlp.experts.99.gate_proj.scales thinker.model.layers.2.mlp.experts.99.up_proj.g_idx thinker.model.layers.2.mlp.experts.99.up_proj.qweight thinker.model.layers.2.mlp.experts.99.up_proj.qzeros thinker.model.layers.2.mlp.experts.99.up_proj.scales thinker.model.layers.2.mlp.experts.105.down_proj.weight thinker.model.layers.2.mlp.experts.105.gate_proj.weight thinker.model.layers.2.mlp.experts.105.up_proj.weight

allerou4 avatar Dec 01 '25 02:12 allerou4

Regarding bug 1:

positional_embedding is a non-persistent buffer, but it is being written to safetensors after calling gptqmodel.utils.model.get_state_dict_for_save(). This error should be related to offload_disk. I am checking the relevant code.

ZX-ModelCloud avatar Dec 09 '25 01:12 ZX-ModelCloud

Regarding bug 2/3:

It has been fixed in the main branch code of vllm.

https://github.com/vllm-project/vllm/pull/29896/files#diff-a65936ff683c1b4c8d7f3cdd49c28022f38d5e7cfbee857e7dc8c4f6731af0f9R1141-R1152

ZX-ModelCloud avatar Dec 09 '25 01:12 ZX-ModelCloud

Regarding bug 1:

positional_embedding is a non-persistent buffer, but it is being written to safetensors after calling gptqmodel.utils.model.get_state_dict_for_save(). This error should be related to offload_disk. I am checking the relevant code.

Bug 1 has fixed: PR#2242

ZX-ModelCloud avatar Dec 09 '25 05:12 ZX-ModelCloud

Regarding bug 2/3:

It has been fixed in the main branch code of vllm.

https://github.com/vllm-project/vllm/pull/29896/files#diff-a65936ff683c1b4c8d7f3cdd49c28022f38d5e7cfbee857e7dc8c4f6731af0f9R1141-R1152

Hi, I think this pr only solve bug2 For bug3, I manually replace whole layer of moe experts with bf16 when some experts escaped quantization, but we still have to apply this unmerged pr in vllm: https://github.com/vllm-project/vllm/pull/27608

maybe we should add a configuration, to skip all moe experts' quantization even if some of them are skipped.

allerou4 avatar Dec 10 '25 05:12 allerou4