vLLM部署的GLM-4-32B-0414如何实现接口工具调用
System Info / 系統信息
vllm部署GLM-4-32B-0414
请求体
{ "model": "GLM-4-32B-0414", "top_p": 0.1, "temperature": 0.01, "tools": [ { "type": "function", "function": { "name": "realtime_aqi", "description": "天气预报。获取实时空气质量。当前空气质量,PM2.5,PM10信息", "parameters": { "type": "object", "properties": { "city": { "description": "城市名" } }, "required": [ "city" ] } } } ], "messages": [ { "role": "user", "content": "How's the weather in Hangzhou?" } ] }
返回为
{ "object": "error", "message": "Hermes 2 Pro Tool parser could not locate tool call start/end tokens in the tokenizer!", "type": "BadRequestError", "param": null, "code": 400 }
是我请求方式的问题吗?有没有可以参考的vllm的工具调用方式呀
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
vllm的openai风格部署的GLM-4-32B-0414,以问题中粘贴的请求体去请求
Expected behavior / 期待表现
期待能够正确的获得工具调用的返回
vllm: 0.8.1 CUDA: 12.0 A100*2
我也想知道 emmm
vllm的--tool-call-parser 设置的是pythonic,已经触发了tools的提示词,但tools_calls 返回了空list。
vllm的--tool-call-parser 设置的是pythonic,已经触发了tools的提示词,但tools_calls 返回了空list。
![]()
哪里有文档指示要用 --tool-call-parser pythonic 么?
现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。
这是要在客户端调用的时候配置?
现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。
是在mcp client 端调用模型时配置。tool_choice="required",作用是必须选择一个或多个tool
CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414
--port 8000
--trust-remote-code
--max-model-len 32768
--tensor-parallel-size 2
--gpu_memory_utilization 0.8
--served-model-name "glm4"
--enable-auto-tool-choice
--tool-call-parser pythonic
--trust-remote-code
运行出错,vllm == 0.8.4 , transformers == 4.51.3 , 2x H100 80Gb
(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414 --port 8000 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu_memory_utilization 0.8 --s
erved-model-name "glm4" --enable-auto-tool-choice --tool-call-parser pythonic --trust-remote-code
INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>)
INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738
INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:12, 1.08it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:11, 1.04it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:07, 1.41it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:08, 1.24it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:04<00:07, 1.19it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:05<00:06, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:06<00:06, 1.10it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:07<00:05, 1.06it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:08<00:04, 1.04it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:09<00:03, 1.06it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:09<00:02, 1.10it/s]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:10<00:01, 1.08it/s]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:11<00:00, 1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.09it/s]
INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds
INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.model_runner.profile_run()
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] x, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 02:05:08 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 02:05:08 [engine.py:448] return cls(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 02:05:08 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 02:05:08 [engine.py:448] self._initialize_kv_caches()
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 02:05:08 [engine.py:448] self.model_executor.determine_num_available_blocks())
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448] results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 02:05:08 [engine.py:448] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 02:05:08 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448] self.model_runner.profile_run()
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 02:05:08 [engine.py:448] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 02:05:08 [engine.py:448] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 02:05:08 [engine.py:448] hidden_or_intermediate_states = model_executable(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 02:05:08 [engine.py:448] return self.forward(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.mlp(hidden_states)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 02:05:08 [engine.py:448] x, _ = self.gate_up_proj(x)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 02:05:08 [engine.py:448] output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 02:05:08 [engine.py:448] return F.linear(x, layer.weight, bias)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
args.dispatch_function(args)
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(llm) (base) lc@ai-h100:~/work/vllm$
CUDA_VISIBLE_DEVICES = 0,1 VLLM_USE_V1 = 0 vllm 服务/ home / lc / work / models / GLM-4-32B-0414 --port 8000 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu_memory_utilization 0.8 --served-model-name“glm4” --enable-auto-tool-choice --tool-call-parser pythonic --trust-remote-code
运行错误,vllm == 0.8.4 , Transformers == 4.51.3 , 2x H100 80Gb
(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414 --port 8000 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu_memory_utilization 0.8 --s erved-model-name "glm4" --enable-auto-tool-choice --tool-call-parser pythonic --trust-remote-code INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda. WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior. INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4 INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>) INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'. INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738 INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda. WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior. INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend. INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda. WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior. (VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend. (VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5 INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2 INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 (VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414... (VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414... Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:12, 1.08it/s] Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:11, 1.04it/s] Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:07, 1.41it/s] Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:08, 1.24it/s] Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:04<00:07, 1.19it/s] Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:05<00:06, 1.16it/s] Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:06<00:06, 1.10it/s] Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:07<00:05, 1.06it/s] Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:08<00:04, 1.04it/s] Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:09<00:03, 1.06it/s] Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:09<00:02, 1.10it/s] Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:10<00:01, 1.08it/s] Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:11<00:00, 1.06it/s] Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.04it/s] Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.09it/s] INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds (VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds (VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks. (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last): (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.model_runner.profile_run() (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self._dummy_run(max_num_batched_tokens, max_num_seqs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.execute_model(model_input, kv_caches, intermediate_tensors) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.model(input_ids, positions, intermediate_tensors, (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self.forward(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states, residual = layer(positions, hidden_states, residual) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.mlp(hidden_states) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] x, _ = self.gate_up_proj(x) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output_parallel = self.quant_method.apply(self, input_, bias) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return F.linear(x, layer.weight, bias) (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last): ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine ERROR 04-17 02:05:08 [engine.py:448] engine = MQLLMEngine.from_vllm_config( ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config ERROR 04-17 02:05:08 [engine.py:448] return cls( ERROR 04-17 02:05:08 [engine.py:448] ^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__ ERROR 04-17 02:05:08 [engine.py:448] self.engine = LLMEngine(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__ ERROR 04-17 02:05:08 [engine.py:448] self._initialize_kv_caches() ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches ERROR 04-17 02:05:08 [engine.py:448] self.model_executor.determine_num_available_blocks()) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks ERROR 04-17 02:05:08 [engine.py:448] results = self.collective_rpc("determine_num_available_blocks") ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc ERROR 04-17 02:05:08 [engine.py:448] return self._run_workers(method, *args, **(kwargs or {})) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers ERROR 04-17 02:05:08 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method, ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks ERROR 04-17 02:05:08 [engine.py:448] self.model_runner.profile_run() ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run ERROR 04-17 02:05:08 [engine.py:448] self._dummy_run(max_num_batched_tokens, max_num_seqs) ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run ERROR 04-17 02:05:08 [engine.py:448] self.execute_model(model_input, kv_caches, intermediate_tensors) ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model ERROR 04-17 02:05:08 [engine.py:448] hidden_or_intermediate_states = model_executable( ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.model(input_ids, positions, intermediate_tensors, ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__ ERROR 04-17 02:05:08 [engine.py:448] return self.forward(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward ERROR 04-17 02:05:08 [engine.py:448] hidden_states, residual = layer(positions, hidden_states, residual) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.mlp(hidden_states) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward ERROR 04-17 02:05:08 [engine.py:448] x, _ = self.gate_up_proj(x) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward ERROR 04-17 02:05:08 [engine.py:448] output_parallel = self.quant_method.apply(self, input_, bias) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply ERROR 04-17 02:05:08 [engine.py:448] return F.linear(x, layer.weight, bias) ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple Traceback (most recent call last): File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main args.dispatch_function(args) File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd uvloop.run(run_server(args)) File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server async with build_async_engine_client(args) as engine_client: File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. /home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' (llm) (base) lc@ai-h100:~/work/vllm$
vllm 不要升级到0.8.4 有bug,只升级Transformers
pip uninstall vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/16618/head:pr-16618
VLLM_USE_PRECOMPILED=1 pip install --editable .
已经卸载原来的版本,重新编译 #16618 后,还是出错
(llm) (base) lc@ai-h100:~/work$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414 \
--port 8000 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.8 \
--served-model-name "glm4" \
--enable-auto-tool-choice \
--tool-call-parser pythonic \
--trust-remote-code
INFO 04-17 20:51:43 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:51:44 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 20:51:45 [api_server.py:1034] vLLM API server version 0.8.3rc2.dev107+gdb95cbc1e.d20250417
INFO 04-17 20:51:45 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7061827b4ea0>)
INFO 04-17 20:51:53 [config.py:604] This model supports multiple tasks: {'generate', 'classify', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-17 20:51:53 [config.py:1609] Defaulting to use mp for distributed inference
INFO 04-17 20:51:53 [api_server.py:246] Started engine process with PID 202583
INFO 04-17 20:51:57 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:51:57 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 20:51:58 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.3rc2.dev107+gdb95cbc1e.d20250417) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 04-17 20:51:59 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 20:52:00 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 20:52:03 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 20:52:03 [cuda.py:413] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:04 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:06 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:07 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:07 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 20:52:07 [utils.py:990] Found nccl from library libnccl.so.2
INFO 04-17 20:52:07 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 20:52:08 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 20:52:08 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_9ca69c7f'), local_subscribe_addr='ipc:///tmp/169495f2-aed7-4b97-ac35-75ab01eafaf5', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 20:52:08 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-17 20:52:08 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:08 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:11, 1.11it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:11, 1.06it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:07, 1.43it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:07, 1.25it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:04<00:07, 1.20it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:05<00:06, 1.17it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:05<00:06, 1.12it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:06<00:05, 1.08it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:07<00:04, 1.05it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:08<00:03, 1.07it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:09<00:02, 1.11it/s]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:10<00:01, 1.09it/s]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:11<00:00, 1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.11it/s]
INFO 04-17 20:52:21 [loader.py:458] Loading weights took 12.65 seconds
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:21 [loader.py:458] Loading weights took 12.66 seconds
INFO 04-17 20:52:21 [model_runner.py:1146] Model loading took 30.4522 GiB and 12.851758 seconds
(VllmWorkerProcess pid=202747) INFO 04-17 20:52:21 [model_runner.py:1146] Model loading took 30.4522 GiB and 12.857013 seconds
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] self.model_runner.profile_run()
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] x_out = self.gate_up_proj(x)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=202747) ERROR 04-17 20:52:23 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 20:52:23 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 20:52:23 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 20:52:23 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 20:52:23 [engine.py:448] return cls(
ERROR 04-17 20:52:23 [engine.py:448] ^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 20:52:23 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 20:52:23 [engine.py:448] self._initialize_kv_caches()
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 20:52:23 [engine.py:448] self.model_executor.determine_num_available_blocks())
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 20:52:23 [engine.py:448] results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 20:52:23 [engine.py:448] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 20:52:23 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
ERROR 04-17 20:52:23 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 20:52:23 [engine.py:448] self.model_runner.profile_run()
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 20:52:23 [engine.py:448] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 20:52:23 [engine.py:448] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 20:52:23 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 20:52:23 [engine.py:448] hidden_or_intermediate_states = model_executable(
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 20:52:23 [engine.py:448] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 20:52:23 [engine.py:448] return self.forward(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
ERROR 04-17 20:52:23 [engine.py:448] hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 20:52:23 [engine.py:448] hidden_states = self.mlp(hidden_states)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 20:52:23 [engine.py:448] x_out = self.gate_up_proj(x)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 20:52:23 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 20:52:23 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 20:52:23 [engine.py:448] output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 20:52:23 [engine.py:448] return F.linear(x, layer.weight, bias)
ERROR 04-17 20:52:23 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 20:52:23 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
INFO 04-17 20:52:23 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
raise e
File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
^^^^
File "/home/lc/work/vllm/vllm/engine/multiprocessing/engine.py", line 82, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 285, in __init__
self._initialize_kv_caches()
File "/home/lc/work/vllm/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/executor/executor_base.py", line 331, in collective_rpc
return self._run_workers(method, *args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/utils.py", line 2363, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
self.model_runner.profile_run()
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
self._dummy_run(max_num_batched_tokens, max_num_seqs)
File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1369, in _dummy_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/worker/model_runner.py", line 1770, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 285, in forward
hidden_states = self.model(input_ids, positions, intermediate_tensors,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/compilation/decorators.py", line 172, in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 369, in forward
hidden_states, residual = layer(positions, hidden_states, residual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/models/glm4.py", line 204, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/models/llama.py", line 92, in forward
x_out = self.gate_up_proj(x)
^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 474, in forward
output_parallel = self.quant_method.apply(self, input_, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/model_executor/layers/linear.py", line 191, in apply
return F.linear(x, layer.weight, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/lc/work/vllm/vllm/entrypoints/cli/main.py", line 51, in main
args.dispatch_function(args)
File "/home/lc/work/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/work/vllm/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
不建议用0.8.4,
源代码编译安装vLLM才能适配,昨天晚上的PR修复了
现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。
你好,这样确实tool_calls有值了,但是finish_reason仍然是stop,请问你有关注到这一点吗
是在mcp client 端调用模型时配置。tool_choice="required",作用是必须选择一个或多个tool
@zRzRzRzRzRzRzR 老师,请问这部分怎么才能解决呢?我也验证了一次,发现在改为required的情况下,是能正常返回,但是理想状态下的智能体,其实是期望能够自主决策在选择需要的执行的函数工具欸
现在部分解决了问题,在使用openai库调用时, response = await self.client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, tool_choice="required", tools=available_tools, temperature=0 ) 添加了tool_choice="required",这是必须要选择最少一个tools。这样tool_calls就有值了。
这个问题是模型本身的问题还是说是框架结构不兼容的问题呢?
vllm 我添加--tool-call-parser pythonic 参数后,按照https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_pythonic.jinja修改了提示词,返回的tools调用应该类似[func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)] 这样,因为,vllm,对于pythonic如果要添加到tools_call,需要这种格式。 在https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py中,
