GLM-4
GLM-4 copied to clipboard
Can run THUDM/GLM-Z1-32B-0414 with --model-impl but not with --tensor-parallel-size 8 being added
System Info / 系統信息
vllm==0.8.4 transformers==4.51.3 torch==2.6.0 cuda==12.4
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
If you run,
vllm serve THUDM/GLM-Z1-32B-0414 --model-impl transformers --tensor-parallel-size 8
You will get,
ERROR 04-16 12:59:58 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-16 12:59:58 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 04-16 12:59:58 [core.py:387] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-16 12:59:58 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_cach es
ERROR 04-16 12:59:58 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_avai lable_memory
ERROR 04-16 12:59:58 [core.py:387] output = self.collective_rpc("determine_available_memory")
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in col lective_rpc
ERROR 04-16 12:59:58 [core.py:387] raise e
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in col lective_rpc
ERROR 04-16 12:59:58 [core.py:387] raise RuntimeError(
ERROR 04-16 12:59:58 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'Failed running call_ method view(*(FakeTensor(..., device=\'cuda:0\', size=(1, s0, 32), dtype=torch.bfloat16), (1, s0, -1, 128)), **{}):\nshape \'[1, s0, -1, 128]\' is invalid fo r input of size 32*s0\n\nfrom user code:\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 425, in forward\n model_output = self.model(input_ids, positions, intermediate_tensors,\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages /vllm/model_executor/models/transformers.py", line 330, in forward\n hidden_states = self.model(\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site- packages/transformers/utils/generic.py", line 965, in wrapper\n output = func(self, *args, **kwargs)\n File "/root/.miniconda3/envs/vllm/lib/python3.12/s ite-packages/transformers/models/glm4/modeling_glm4.py", line 589, in forward\n layer_outputs = decoder_layer(\n File "/root/.miniconda3/envs/vllm/lib/py thon3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 115, in forward\n hidden_states, self_attn_weights = self.self_attn(\n File "/roo t/.miniconda3/envs/vllm/lib/python3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 268, in forward\n key_states = self.k_proj(hidden_st ates).view(hidden_shape).transpose(1, 2)\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n import torch._dynamo\n torch._dynamo.config.suppress_errors = True\n')
ERROR 04-16 12:59:58 [core.py:387]
Expected behavior / 期待表现
vllm serve works for single gpu running but fails to run on multiple gpus AFTER the safetensors are loaded.
p.s. I tried with --tensor-parallel-size 2 and it works but can't work with size of 4.
你应该更新现在的最新的vLLM源代码并应用 https://github.com/vllm-project/vllm/pull/16618 丢弃 --model-impl transformers 因为 https://github.com/vllm-project/vllm/pull/16618 已经修复了vLLM支持问题
你应该更新现在的最新的vLLM源代码并应用 vllm-project/vllm#16618 丢弃 --model-impl transformers 因为 vllm-project/vllm#16618 已经修复了vLLM支持问题
@zRzRzRzRzRzRzR 好的,我以为那个是只有9b的模型才能用那个方式解决,我先改代码,请问有预计什么时候merge吗?
暂时不确定,这个PR不是由我提起,需要原始作者满足vLLM的合并规范,vLLM官方会进行合并操作~,感谢理解,同时,感谢相关社区作者的热情贡献