inference 使用自定义模型，启动之后发起聊天会话失败

System Info / 系統信息

CUDA Version: 12.6

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[x] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference, version 1.3.1

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose.yml文件如下：

services:

xinference: image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/xprobe/xinference:v1.3.1 container_name: xinference ports: - "9997:9997" volumes: - /home/eddie/dev/docker-service/data/xinference/.xinference:/root/.xinference - /home/eddie/dev/docker-service/data/xinference/.cache/huggingface:/root/.cache/huggingface - /home/eddie/dev/docker-service/data/xinference/.cache/modelScope:/root/.cache/modelScope - /home/eddie/dev/docker-service/data/xinference/log:/xinference/logs environment: - XINFERENCE_HOME=/xinference - XINFERENCE_MODEL_SRC=modelscope restart: always command: xinference-local -H 0.0.0.0 --log-level debug deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu"] count: all

Reproduction / 复现过程

操作过程见下： 1、注册本地模型，json格式如下： { "version": 1, "model_name": "Qwen2___5-14B-Instruct", "model_description": "Qwen2___5-14B-Instruct", "context_length": 8000, "model_lang": [ "en", "zh" ], "model_ability": [ "chat", "generate" ], "model_family": "qwen2.5-instruct", "model_specs": [ { "model_uri": "/root/.cache/modelScope/models/qwen/Qwen2___5-14B-Instruct", "model_size_in_billions": 14, "model_format": "pytorch", "quantizations": [ "none" ] } ], "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": , \"arguments\": }\n</tool_call><|im_end|>\n" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}\n {%- else %}\n {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}\n {%- elif message.role == "assistant" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\n<tool_call>\n{"name": "' }}\n {{- tool_call.name }}\n {{- '", "arguments": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\n' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\n<tool_response>\n' }}\n {{- message.content }}\n {{- '\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- '<|im_end|>\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\n' }}\n{%- endif %}\n", "stop_token_ids": [ 151643, 151644, 151645 ], "stop": [ "<|endoftext|>", "<|im_start|>", "<|im_end|>" ] } 2、相关的配置如下：（以命令行形式体现） xinference launch --model-name Qwen2___5-14B-Instruct --model-type LLM --model-engine Transformers --model-format pytorch --size-in-billions 14 --quantization none --n-gpu 1 --replica 1 --n-worker 1 3、启动聊天窗口，输入“你好啊”，界面提示： Chat completion stream got an error: [address=0.0.0.0:46583, pid=230] can only concatenate str (not "NoneType") to str，后台详细运行日志如下： 2025-04-18 23:47:25,421 transformers.generation.configuration_utils 230 INFO loading configuration file /xinference/cache/Qwen2___5-14B-Instruct-pytorch-14b/generation_config.json loading configuration file /xinference/cache/Qwen2___5-14B-Instruct-pytorch-14b/generation_config.json 2025-04-18 23:47:25,422 transformers.generation.configuration_utils 230 INFO Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

2025-04-18 23:48:47,980 xinference.core.supervisor 61 DEBUG [request 58693876-1cea-11f0-b69c-0242ac130003] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,003 xinference.core.worker 61 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,008 xinference.core.worker 61 DEBUG Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,009 xinference.core.supervisor 61 DEBUG [request 58693876-1cea-11f0-b69c-0242ac130003] Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,088 xinference.core.supervisor 61 DEBUG [request 587b46ce-1cea-11f0-b69c-0242ac130003] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,089 xinference.core.worker 61 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,090 xinference.core.worker 61 DEBUG Leave get_model, elapsed time: 0 s 2025-04-18 23:48:48,090 xinference.core.supervisor 61 DEBUG [request 587b46ce-1cea-11f0-b69c-0242ac130003] Leave get_model, elapsed time: 0 s 2025-04-18 23:48:48,099 xinference.core.supervisor 61 DEBUG [request 587d06c6-1cea-11f0-b69c-0242ac130003] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,100 xinference.core.worker 61 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,100 xinference.core.worker 61 DEBUG Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,100 xinference.core.supervisor 61 DEBUG [request 587d06c6-1cea-11f0-b69c-0242ac130003] Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,165 xinference.core.model 230 DEBUG Request chat, current serve request count: 0, request limit: inf for the model Qwen2___5-14B-Instruct 2025-04-18 23:48:48,196 xinference.core.model 230 DEBUG [request 588af45c-1cea-11f0-820f-0242ac130003] Enter chat, args: ModelActor(Qwen2___5-14B-Instruct-0),[{'role': 'user', 'content': '你好啊'}, {'role': 'assistant', 'content': None}, {'role': 'user', 'conte...,{'max_tokens': 984, 'temperature': 0.3, 'stream': True, 'lora_name': ''}, kwargs: raw_params={'max_tokens': 984, 'temperature': 0.3, 'stream': True, 'lora_name': ''} 2025-04-18 23:48:48,225 xinference.core.model 230 DEBUG [request 588af45c-1cea-11f0-820f-0242ac130003] Leave chat, elapsed time: 0 s 2025-04-18 23:48:48,229 xinference.core.model 230 DEBUG After request chat, current serve request count: 0 for the model Qwen2___5-14B-Instruct 2025-04-18 23:51:32,435 xinference.model.llm.transformers.utils 230 DEBUG Average throughput for a step: 0.06882436018044197 token/s. 2025-04-18 23:51:32,697 xinference.model.llm.utils 230 WARNING tokenizer.apply_chat_template error. Maybe this is an old model: can only concatenate str (not "NoneType") to str 2025-04-18 23:51:32,767 xinference.model.llm.transformers.core 230 ERROR prepare inference error with can only concatenate str (not "NoneType") to str Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 139, in get_full_context full_context = tokenizer.apply_chat_template( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1695, in apply_chat_template rendered_chat = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1304, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 939, in handle_exception raise rewrite_traceback_stack(source=source) File "

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 713, in prepare_batch_inference r.full_prompt = self._get_full_prompt(r.prompt, tools) File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 699, in _get_full_prompt full_prompt = self.get_full_context( File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 151, in get_full_context return self._build_from_raw_template(messages, chat_template, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 120, in _build_from_raw_template rendered = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1304, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 939, in handle_exception raise rewrite_traceback_stack(source=source) File "

Expected behavior / 期待表现

希望能帮忙看看这个问题到底是怎么回事，提供建议解决。

Apr 19 '25 07:04 leo-bot

输入nvidia-smi，发现显存确实被占用，但是调用模型却失败。资源占用情况： 0% 48C P3 41W / 220W | 9124MiB / 12282MiB | 10% Default 进程情况： 0 N/A N/A 230 C /python3.10 N/A

Apr 19 '25 07:04 leo-bot

This issue is stale because it has been open for 7 days with no activity.

Apr 26 '25 19:04 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

May 01 '25 19:05 github-actions[bot]