GLM-4 icon indicating copy to clipboard operation
GLM-4 copied to clipboard

vLLM部署的GLM-4-32B-0414如何实现接口工具调用

Open jianpugh opened this issue 9 months ago • 16 comments

System Info / 系統信息

vllm部署GLM-4-32B-0414

请求体 { "model": "GLM-4-32B-0414", "top_p": 0.1, "temperature": 0.01, "tools": [ { "type": "function", "function": { "name": "realtime_aqi", "description": "天气预报。获取实时空气质量。当前空气质量,PM2.5,PM10信息", "parameters": { "type": "object", "properties": { "city": { "description": "城市名" } }, "required": [ "city" ] } } } ], "messages": [ { "role": "user", "content": "How's the weather in Hangzhou?" } ] }

返回为 { "object": "error", "message": "Hermes 2 Pro Tool parser could not locate tool call start/end tokens in the tokenizer!", "type": "BadRequestError", "param": null, "code": 400 }

是我请求方式的问题吗?有没有可以参考的vllm的工具调用方式呀

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • [x] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

vllm的openai风格部署的GLM-4-32B-0414,以问题中粘贴的请求体去请求

Expected behavior / 期待表现

期待能够正确的获得工具调用的返回

jianpugh avatar Apr 15 '25 13:04 jianpugh

vllm: 0.8.1 CUDA: 12.0 A100*2

jianpugh avatar Apr 15 '25 13:04 jianpugh

我也想知道 emmm

xbl916 avatar Apr 15 '25 14:04 xbl916

vllm的--tool-call-parser 设置的是pythonic,已经触发了tools的提示词,但tools_calls 返回了空list。

Image Image

gaoming1227 avatar Apr 16 '25 03:04 gaoming1227

vllm的--tool-call-parser 设置的是pythonic,已经触发了tools的提示词,但tools_calls 返回了空list。

Image Image

哪里有文档指示要用 --tool-call-parser pythonic 么?

jianpugh avatar Apr 16 '25 03:04 jianpugh

现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。

gaoming1227 avatar Apr 16 '25 08:04 gaoming1227

这是要在客户端调用的时候配置?

现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。

xbl916 avatar Apr 16 '25 10:04 xbl916

是在mcp client 端调用模型时配置。tool_choice="required",作用是必须选择一个或多个tool

gaoming1227 avatar Apr 16 '25 11:04 gaoming1227

CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414
--port 8000
--trust-remote-code
--max-model-len 32768
--tensor-parallel-size 2
--gpu_memory_utilization 0.8
--served-model-name "glm4"
--enable-auto-tool-choice
--tool-call-parser pythonic
--trust-remote-code

运行出错,vllm == 0.8.4 , transformers == 4.51.3 , 2x H100 80Gb

(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414   --port 8000   --trust-remote-code   --max-model-len 32768   --tensor-parallel-size 2   --gpu_memory_utilization 0.8   --s
erved-model-name "glm4"   --enable-auto-tool-choice   --tool-call-parser pythonic   --trust-remote-code 
INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>)
INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738
INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:12,  1.08it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:01<00:11,  1.04it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:02<00:07,  1.41it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:03<00:08,  1.24it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:04<00:07,  1.19it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:05<00:06,  1.16it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:06<00:06,  1.10it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:07<00:05,  1.06it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:08<00:04,  1.04it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:09<00:03,  1.06it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:09<00:02,  1.10it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:10<00:01,  1.08it/s]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:11<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.09it/s]

INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds
INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.model_runner.profile_run()
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     x, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 02:05:08 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 02:05:08 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 02:05:08 [engine.py:448]     return cls(
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 02:05:08 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-17 02:05:08 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 02:05:08 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 02:05:08 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 02:05:08 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 02:05:08 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 02:05:08 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 02:05:08 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 02:05:08 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-17 02:05:08 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 02:05:08 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 02:05:08 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 02:05:08 [engine.py:448]     x, _ = self.gate_up_proj(x)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 02:05:08 [engine.py:448]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 02:05:08 [engine.py:448]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 02:05:08 [engine.py:448]     return F.linear(x, layer.weight, bias)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
  File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
(llm) (base) lc@ai-h100:~/work/vllm$ 

iwaitu avatar Apr 16 '25 18:04 iwaitu

CUDA_VISIBLE_DEVICES = 0,1 VLLM_USE_V1 = 0 vllm 服务/ home / lc / work / models / GLM-4-32B-0414 --port 8000 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu_memory_utilization 0.8 --served-model-name“glm4” --enable-auto-tool-choice --tool-call-parser pythonic --trust-remote-code

运行错误,vllm == 0.8.4 , Transformers == 4.51.3 , 2x H100 80Gb

(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414   --port 8000   --trust-remote-code   --max-model-len 32768   --tensor-parallel-size 2   --gpu_memory_utilization 0.8   --s
erved-model-name "glm4"   --enable-auto-tool-choice   --tool-call-parser pythonic   --trust-remote-code 
INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>)
INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738
INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:12,  1.08it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:01<00:11,  1.04it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:02<00:07,  1.41it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:03<00:08,  1.24it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:04<00:07,  1.19it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:05<00:06,  1.16it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:06<00:06,  1.10it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:07<00:05,  1.06it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:08<00:04,  1.04it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:09<00:03,  1.06it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:09<00:02,  1.10it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:10<00:01,  1.08it/s]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:11<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.09it/s]

INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds
INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.model_runner.profile_run()
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     x, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 02:05:08 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 02:05:08 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 02:05:08 [engine.py:448]     return cls(
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 02:05:08 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-17 02:05:08 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 02:05:08 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 02:05:08 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 02:05:08 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 02:05:08 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 02:05:08 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 02:05:08 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 02:05:08 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-17 02:05:08 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 02:05:08 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 02:05:08 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 02:05:08 [engine.py:448]     x, _ = self.gate_up_proj(x)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 02:05:08 [engine.py:448]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 02:05:08 [engine.py:448]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 02:05:08 [engine.py:448]     return F.linear(x, layer.weight, bias)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
  File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
(llm) (base) lc@ai-h100:~/work/vllm$ 

vllm 不要升级到0.8.4 有bug,只升级Transformers

gaoming1227 avatar Apr 17 '25 00:04 gaoming1227

pip uninstall vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/16618/head:pr-16618
VLLM_USE_PRECOMPILED=1 pip install --editable .

已经卸载原来的版本,重新编译 #16618 后,还是出错

(llm) (base) lc@ai-h100:~/work$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414 \
  --port 8000 \
  --trust-remote-code \
  --max-model-len 32768 \
  --tensor-parallel-size 2 \
  --gpu_memory_utilization 0.8 \
  --served-model-name "glm4" \
  --enable-auto-tool-choice \
  --tool-call-parser pythonic \
  --trust-remote-code 
INFO 04-17 20:51:43 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:51:44 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 20:51:45 [api_server.py:1034] vLLM API server version 0.8.3rc2.dev107+gdb95cbc1e.d20250417
INFO 04-17 20:51:45 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7061827b4ea0>)
INFO 04-17 20:51:53 [config.py:604] This model supports multiple tasks: {'generate', 'classify', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-17 20:51:53 [config.py:1609] Defaulting to use mp for distributed inference
INFO 04-17 20:51:53 [api_server.py:246] Started engine process with PID 202583
INFO 04-17 20:51:57 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:51:57 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 20:51:58 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.3rc2.dev107+gdb95cbc1e.d20250417) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 04-17 20:51:59 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 20:52:00 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 20:52:03 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:52:03 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:04 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:06 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:07 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 20:52:07 [utils.py:990] Found nccl from library libnccl.so.2
INFO 04-17 20:52:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 20:52:08 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 20:52:08 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_9ca69c7f'), local_subscribe_addr='ipc:///tmp/169495f2-aed7-4b97-ac35-75ab01eafaf5', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 20:52:08 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-17 20:52:08 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:11,  1.11it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:01<00:11,  1.06it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:02<00:07,  1.43it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:03<00:07,  1.25it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:04<00:07,  1.20it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:05<00:06,  1.17it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:05<00:06,  1.12it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:06<00:05,  1.08it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:07<00:04,  1.05it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:08<00:03,  1.07it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:09<00:02,  1.11it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:10<00:01,  1.09it/s]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:11<00:00,  1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.11it/s]

INFO 04-17 20:52:21 [loader.py:458] Loading weights took 12.65 seconds
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:21 [loader.py:458] Loading weights took 12.66 seconds
INFO 04-17 20:52:21 [model_runner.py:1146] Model loading took 30.4522 GiB and 12.851758 seconds
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:21 [model_runner.py:1146] Model loading took 30.4522 GiB and 12.857013 seconds
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     self.model_runner.profile_run()
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     x_out = self.gate_up_proj(x)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]             ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]   File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]     return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 20:52:23 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 20:52:23 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 20:52:23 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 20:52:23 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 20:52:23 [engine.py:448]     return cls(
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 20:52:23 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 20:52:23 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 20:52:23 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-17 20:52:23 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 20:52:23 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 20:52:23 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 20:52:23 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 20:52:23 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 20:52:23 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
ERROR 04-17 20:52:23 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 20:52:23 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 20:52:23 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 20:52:23 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 20:52:23 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-17 20:52:23 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 20:52:23 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 20:52:23 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 20:52:23 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
ERROR 04-17 20:52:23 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 20:52:23 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 20:52:23 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-17 20:52:23 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 20:52:23 [engine.py:448]     x_out = self.gate_up_proj(x)
ERROR 04-17 20:52:23 [engine.py:448]             ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 20:52:23 [engine.py:448]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 20:52:23 [engine.py:448]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448]   File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 20:52:23 [engine.py:448]     return F.linear(x, layer.weight, bias)
ERROR 04-17 20:52:23 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
INFO 04-17 20:52:23 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
    raise e
  File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
    return cls(
           ^^^^
  File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 82, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 285, in __init__
    self._initialize_kv_caches()
  File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
    results = self.collective_rpc("determine_num_available_blocks")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 331, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
    self._dummy_run(max_num_batched_tokens, max_num_seqs)
  File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
    hidden_states, residual = layer(positions, hidden_states, residual)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
    x_out = self.gate_up_proj(x)
            ^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
    return F.linear(x, layer.weight, bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
  File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lc/work/vllm/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/home/lc/work/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

iwaitu avatar Apr 17 '25 12:04 iwaitu

不建议用0.8.4,

Image 还是这里有问题,我用的是0.8.3

gaoming1227 avatar Apr 18 '25 06:04 gaoming1227

源代码编译安装vLLM才能适配,昨天晚上的PR修复了

zRzRzRzRzRzRzR avatar Apr 18 '25 08:04 zRzRzRzRzRzRzR

现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。

你好,这样确实tool_calls有值了,但是finish_reason仍然是stop,请问你有关注到这一点吗

zh-211 avatar Apr 23 '25 03:04 zh-211

是在mcp client 端调用模型时配置。tool_choice="required",作用是必须选择一个或多个tool

@zRzRzRzRzRzRzR 老师,请问这部分怎么才能解决呢?我也验证了一次,发现在改为required的情况下,是能正常返回,但是理想状态下的智能体,其实是期望能够自主决策在选择需要的执行的函数工具欸

Lbaiall avatar Apr 27 '25 14:04 Lbaiall

现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。

这个问题是模型本身的问题还是说是框架结构不兼容的问题呢?

Lbaiall avatar Apr 28 '25 02:04 Lbaiall

vllm 我添加--tool-call-parser pythonic 参数后,按照https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_pythonic.jinja修改了提示词,返回的tools调用应该类似[func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)] 这样,因为,vllm,对于pythonic如果要添加到tools_call,需要这种格式。 在https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py中,

Image 需要按格式匹配到后才能tool_calls中,我修改提示词后,仍不能按照这种格式返回。我也不清楚是不是提示词的问题。感觉是提示词的问题,这个需要官方给出标准的提示词吧。 这是我的理解

gaoming1227 avatar May 03 '25 11:05 gaoming1227