新的moe模型使用vllm启动报错AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
已经通过源码构建transformers和vllm
完整报错信息如下:
Traceback (most recent call last):
File "/home/yu/anaconda3/envs/pt22cu121/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yu/anaconda3/envs/pt22cu121/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yu/project/vllm/vllm/entrypoints/openai/api_server.py", line 156, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/home/yu/project/vllm/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
engine = cls(
File "/home/yu/project/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/yu/project/vllm/vllm/engine/async_llm_engine.py", line 422, in _init_engine
return engine_class(*args, **kwargs)
File "/home/yu/project/vllm/vllm/engine/llm_engine.py", line 111, in __init__
self.model_executor = executor_class(model_config, cache_config,
File "/home/yu/project/vllm/vllm/executor/gpu_executor.py", line 37, in __init__
self._init_worker()
File "/home/yu/project/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
self.driver_worker.load_model()
File "/home/yu/project/vllm/vllm/worker/worker.py", line 107, in load_model
self.model_runner.load_model()
File "/home/yu/project/vllm/vllm/worker/model_runner.py", line 95, in load_model
self.model = get_model(
File "/home/yu/project/vllm/vllm/model_executor/model_loader.py", line 91, in get_model
model = model_class(model_config.hf_config, linear_method)
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 378, in __init__
self.model = Qwen2MoeModel(config, linear_method)
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 342, in __init__
self.layers = nn.ModuleList([
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 343, in <listcomp>
Qwen2MoeDecoderLayer(config,
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 284, in __init__
self.mlp = Qwen2MoeSparseMoeBlock(config=config,
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
self.pack_params()
File "/home/yu/project/vllm/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
w1.append(expert.gate_up_proj.weight)
File "/home/yu/anaconda3/envs/pt22cu121/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
我也遇到这个问题了 不知道是不是显卡的问题 我的是2080ti * 2
我想我大概找到问题了,是vllm的版本的问题 当你使用最新的vllm的时候,就会出现这个问题,但是如果你使用官方dockerfile给的vllm==0.3.0的版本,则会出现
alueError: Model architectures ['Qwen2MoeForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'PhiForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM']
所以 希望官方早点解决这个问题
遇到了一样的错误,这个有解决方案了吗?
+1 。求。
这个问题在0.4.0貌似也出现了,我记得是升transformers到4.40.0解决。
+1,完整报错信息如下:
WARNING 06-07 15:49:56 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-06-07 15:49:58,873 INFO worker.py:1724 -- Started a local Ray instance.
INFO 06-07 15:50:00 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/opt/dtdream/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', tokenizer='/opt/dtdream/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=14336, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-07 15:50:18 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 06-07 15:50:18 selector.py:21] Using XFormers backend.
2024-06-07 15:50:18 | INFO | stdout | (RayWorkerVllm pid=615) INFO 06-07 15:50:18 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
2024-06-07 15:50:18 | INFO | stdout | (RayWorkerVllm pid=615) INFO 06-07 15:50:18 selector.py:21] Using XFormers backend.
2024-06-07 15:50:20 | INFO | stdout | (RayWorkerVllm pid=615) INFO 06-07 15:50:19 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 06-07 15:50:20 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 06-07 15:50:26 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) WARNING 06-07 15:50:26 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set.
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) WARNING 06-07 15:50:26 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=767) INFO 06-07 15:50:18 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=767) INFO 06-07 15:50:18 selector.py:21] Using XFormers backend. [repeated 2x across cluster]
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=767) INFO 06-07 15:50:20 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster]
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] Traceback (most recent call last):
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] return executor(*args, **kwargs)
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 107, in load_model
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] self.model_runner.load_model()
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 95, in load_model
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] self.model = get_model(
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 60, in get_model
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] capability = torch.cuda.get_device_capability()
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/torch/cuda/init.py", line 435, in get_device_capability
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] prop = get_device_properties(device)
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] File "/root/conda/lib/python3.10/site-packages/torch/cuda/init.py", line 452, in get_device_properties
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] raise AssertionError("Invalid device id")
2024-06-07 15:50:26 | INFO | stdout | (RayWorkerVllm pid=615) ERROR 06-07 15:50:26 ray_utils.py:44] AssertionError: Invalid device id
2024-06-07 15:50:26 | ERROR | stderr | Traceback (most recent call last):
2024-06-07 15:50:26 | ERROR | stderr | File "/root/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-06-07 15:50:26 | ERROR | stderr | return _run_code(code, main_globals, None,
2024-06-07 15:50:26 | ERROR | stderr | File "/root/conda/lib/python3.10/runpy.py", line 86, in _run_code
2024-06-07 15:50:26 | ERROR | stderr | exec(code, run_globals)
2024-06-07 15:50:26 | ERROR | stderr | File "/root/conda/lib/python3.10/site-packages/fastchat/serve/vllm_worker.py", line 290, in
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
我的报错信息,使用的是Qwen2-57B-A14B-Instruct-GPTQ-Int4模型,但如果使用Qwen2-7B-Instruct是能正常启动的。使用vllm=0.4.1+cu118. transformers-4.41.2
kenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-07 17:06:47 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1
INFO 06-07 17:06:48 selector.py:28] Using FlashAttention backend.
Traceback (most recent call last):
File "/root/anaconda3/envs/llama_factory/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/llama_factory/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in
+1,遇到了相同的问题,Qwen2-57B-A14B-Instruct-GPTQ-Int4 应该如何用vllm推理?
+1,遇到了相同的问题,Qwen2-57B-A14B-Instruct-GPTQ-Int4 应该如何用vllm推理?
貌似是vllm不支持加载量化的模型,测试过加载未量化的模型没问题
我的报错信息,使用的是Qwen2-57B-A14B-Instruct-GPTQ-Int4模型,但如果使用Qwen2-7B-Instruct是能正常启动的。使用vllm=0.4.1+cu118. transformers-4.41.2
kenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-07 17:06:47 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1 INFO 06-07 17:06:48 selector.py:28] Using FlashAttention backend. Traceback (most recent call last): File "/root/anaconda3/envs/llama_factory/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/llama_factory/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in engine = AsyncLLMEngine.from_engine_args( File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args engine = cls( File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in init self.engine = self._init_engine(*args, **kwargs) File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine return engine_class(*args, **kwargs) File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 148, in init self.model_executor = executor_class( File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 22, in _init_executor self._init_non_spec_worker() File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 51, in _init_non_spec_worker self.driver_worker.load_model() File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/worker/worker.py", line 117, in load_model self.model_runner.load_model() File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 162, in load_model self.model = get_model( File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model return loader.load_model(model_config=model_config, File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model model = _initialize_model(model_config, self.load_config, File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 90, in _initialize_model return model_class(config=model_config.hf_config, File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 378, in init self.model = Qwen2MoeModel(config, linear_method) File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 340, in init self.layers = nn.ModuleList([ File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 341, in Qwen2MoeDecoderLayer(config, File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 282, in init self.mlp = Qwen2MoeSparseMoeBlock(config=config, File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 112, in init self.pack_params() File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 136, in pack_params w1.append(expert.gate_up_proj.weight) File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?
亲测transformers==4.40.1+vllm==0.4.0+cuda12.1,没问题
transformers==4.41.0 vllm==0.4.0.post1 torch==2.1.2 测试了可以加载Qwen2-72B-Instruct-AWQ,但Qwen2-57B-A14B-Instruct-GPTQ-Int4仍然不能成功。
transformers==4.41.2 vllm==0.5.0 torch==2.3.0 cuda=12.0 57b会报错raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'? 其他的可以 是因为moe的原因?
我也是Qwen2-57B-A14B-Instruct-GPTQ-Int4不成功。
Hi, guys! Quantized MoE models are not currently supported in vllm.
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.