使用自定义模型,启动之后发起聊天会话失败
System Info / 系統信息
CUDA Version: 12.6
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
xinference, version 1.3.1
The command used to start Xinference / 用以启动 xinference 的命令
docker-compose.yml文件如下:
services:
xinference: image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/xprobe/xinference:v1.3.1 container_name: xinference ports: - "9997:9997" volumes: - /home/eddie/dev/docker-service/data/xinference/.xinference:/root/.xinference - /home/eddie/dev/docker-service/data/xinference/.cache/huggingface:/root/.cache/huggingface - /home/eddie/dev/docker-service/data/xinference/.cache/modelScope:/root/.cache/modelScope - /home/eddie/dev/docker-service/data/xinference/log:/xinference/logs environment: - XINFERENCE_HOME=/xinference - XINFERENCE_MODEL_SRC=modelscope restart: always command: xinference-local -H 0.0.0.0 --log-level debug deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu"] count: all
Reproduction / 复现过程
操作过程见下:
1、注册本地模型,json格式如下:
{
"version": 1,
"model_name": "Qwen2___5-14B-Instruct",
"model_description": "Qwen2___5-14B-Instruct",
"context_length": 8000,
"model_lang": [
"en",
"zh"
],
"model_ability": [
"chat",
"generate"
],
"model_family": "qwen2.5-instruct",
"model_specs": [
{
"model_uri": "/root/.cache/modelScope/models/qwen/Qwen2___5-14B-Instruct",
"model_size_in_billions": 14,
"model_format": "pytorch",
"quantizations": [
"none"
]
}
],
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within
Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }
2025-04-18 23:48:47,980 xinference.core.supervisor 61 DEBUG [request 58693876-1cea-11f0-b69c-0242ac130003] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,003 xinference.core.worker 61 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,008 xinference.core.worker 61 DEBUG Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,009 xinference.core.supervisor 61 DEBUG [request 58693876-1cea-11f0-b69c-0242ac130003] Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,088 xinference.core.supervisor 61 DEBUG [request 587b46ce-1cea-11f0-b69c-0242ac130003] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,089 xinference.core.worker 61 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,090 xinference.core.worker 61 DEBUG Leave get_model, elapsed time: 0 s 2025-04-18 23:48:48,090 xinference.core.supervisor 61 DEBUG [request 587b46ce-1cea-11f0-b69c-0242ac130003] Leave get_model, elapsed time: 0 s 2025-04-18 23:48:48,099 xinference.core.supervisor 61 DEBUG [request 587d06c6-1cea-11f0-b69c-0242ac130003] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f435cbdde40>,Qwen2___5-14B-Instruct, kwargs: 2025-04-18 23:48:48,100 xinference.core.worker 61 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f435cc10130>, kwargs: model_uid=Qwen2___5-14B-Instruct-0 2025-04-18 23:48:48,100 xinference.core.worker 61 DEBUG Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,100 xinference.core.supervisor 61 DEBUG [request 587d06c6-1cea-11f0-b69c-0242ac130003] Leave describe_model, elapsed time: 0 s 2025-04-18 23:48:48,165 xinference.core.model 230 DEBUG Request chat, current serve request count: 0, request limit: inf for the model Qwen2___5-14B-Instruct 2025-04-18 23:48:48,196 xinference.core.model 230 DEBUG [request 588af45c-1cea-11f0-820f-0242ac130003] Enter chat, args: ModelActor(Qwen2___5-14B-Instruct-0),[{'role': 'user', 'content': '你好啊'}, {'role': 'assistant', 'content': None}, {'role': 'user', 'conte...,{'max_tokens': 984, 'temperature': 0.3, 'stream': True, 'lora_name': ''}, kwargs: raw_params={'max_tokens': 984, 'temperature': 0.3, 'stream': True, 'lora_name': ''} 2025-04-18 23:48:48,225 xinference.core.model 230 DEBUG [request 588af45c-1cea-11f0-820f-0242ac130003] Leave chat, elapsed time: 0 s 2025-04-18 23:48:48,229 xinference.core.model 230 DEBUG After request chat, current serve request count: 0 for the model Qwen2___5-14B-Instruct 2025-04-18 23:51:32,435 xinference.model.llm.transformers.utils 230 DEBUG Average throughput for a step: 0.06882436018044197 token/s. 2025-04-18 23:51:32,697 xinference.model.llm.utils 230 WARNING tokenizer.apply_chat_template error. Maybe this is an old model: can only concatenate str (not "NoneType") to str 2025-04-18 23:51:32,767 xinference.model.llm.transformers.core 230 ERROR prepare inference error with can only concatenate str (not "NoneType") to str Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 139, in get_full_context full_context = tokenizer.apply_chat_template( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1695, in apply_chat_template rendered_chat = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1304, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 939, in handle_exception raise rewrite_traceback_stack(source=source) File "", line 44, in top-level template code TypeError: can only concatenate str (not "NoneType") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 713, in prepare_batch_inference r.full_prompt = self._get_full_prompt(r.prompt, tools) File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 699, in _get_full_prompt full_prompt = self.get_full_context( File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 151, in get_full_context return self._build_from_raw_template(messages, chat_template, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 120, in _build_from_raw_template rendered = compiled_template.render( File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1304, in render self.environment.handle_exception() File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 939, in handle_exception raise rewrite_traceback_stack(source=source) File "", line 44, in top-level template code TypeError: can only concatenate str (not "NoneType") to str Destroy generator 589074681cea11f0820f0242ac130003 due to an error encountered. Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 419, in xoscar_next r = await asyncio.create_task(_async_wrapper(gen)) File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 409, in _async_wrapper return await _gen.anext() # noqa: F821 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 569, in _to_async_gen async for v in gen: File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 762, in _queue_consumer raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :]) RuntimeError: can only concatenate str (not "NoneType") to str 2025-04-18 23:51:33,091 xinference.api.restful_api 1 ERROR Chat completion stream got an error: [address=0.0.0.0:46583, pid=230] can only concatenate str (not "NoneType") to str
Expected behavior / 期待表现
希望能帮忙看看这个问题到底是怎么回事,提供建议解决。
输入nvidia-smi,发现显存确实被占用,但是调用模型却失败。 资源占用情况: 0% 48C P3 41W / 220W | 9124MiB / 12282MiB | 10% Default 进程情况: 0 N/A N/A 230 C /python3.10 N/A
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.