inference DeepSeek-R1-Distill-Qwen-14B-GGUF 和 deepseek-r1-distill-qwen-14b-awq 都加载失败

System Info / 系統信息

DeepSeek-R1-Distill-Qwen-14B-GGUF 和 deepseek-r1-distill-qwen-14b-awq 都加载失败系统：Window11 显卡: 4070tis

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[x] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

Xinference版本：v1.2.1 驱动：560.94 cuda：cuda_12.3.r12.3

The command used to start Xinference / 用以启动 xinference 的命令

DeepSeek-R1-Distill-Qwen-14B-GGUF配置：

{ "version": 1, "context_length": 12800, "model_name": "DeepSeek-R1-Distill-Qwen-14B-GGUF", "model_lang": [ "en", "zh", "ch" ], "model_ability": [ "generate", "chat" ], "model_description": "This is a custom model description.", "model_family": "deepseek-r1-distill-qwen", "model_specs": [ { "model_format": "ggufv2", "model_size_in_billions": 14, "quantizations": [ "Q6_K" ], "model_id": null, "model_file_name_template": "DeepSeek-R1-Distill-Qwen-14B-GGUF", "model_file_name_split_template": null, "quantization_parts": null, "model_hub": "huggingface", "model_uri": "/data", "model_revision": null } ], "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}", "stop_token_ids": [ 151643 ], "stop": [ "<｜end▁of▁sentence｜>" ], "is_builtin": false }

deepseek-r1-distill-qwen-14b-awq配置：

{ "version": 1, "context_length": 12800, "model_name": "deepseek-r1-distill-qwen-14b-awq", "model_lang": [ "en", "zh" ], "model_ability": [ "generate", "chat" ], "model_description": "This is a custom model description.", "model_family": "deepseek-r1-distill-qwen", "model_specs": [ { "model_format": "awq", "model_size_in_billions": 14, "quantizations": [ "Int4" ], "model_id": null, "model_hub": "huggingface", "model_uri": "/data/deepseek-r1-distill-qwen-14b-awq", "model_revision": null } ], "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}", "stop_token_ids": [ 151643 ], "stop": [ "<｜end▁of▁sentence｜>" ], "is_builtin": false }

Reproduction / 复现过程

DeepSeek-R1-Distill-Qwen-14B-GGUF错误信息：

GGUF我使用的是llama引擎： 2025-02-04 20:37:52 2025-02-04 04:37:52,016 xinference.core.worker 45 INFO [request d977912c-e2f4-11ef-bd47-0242ac110003] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f7baf8f8270>, kwargs: model_uid=DeepSeek-R1-Distill-Qwen-14B-GGUF-0,model_name=DeepSeek-R1-Distill-Qwen-14B-GGUF,model_size_in_billions=14,model_format=ggufv2,quantization=Q6_K,model_engine=llama.cpp,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[0],download_hub=None,model_path=None,xavier_config=None 2025-02-04 20:37:52 2025-02-04 04:37:52,017 xinference.core.worker 45 INFO You specify to launch the model: DeepSeek-R1-Distill-Qwen-14B-GGUF on GPU index: [0] of the worker: 0.0.0.0:52371, xinference will automatically ignore the n_gpu option. 2025-02-04 20:37:52 2025-02-04 04:37:52,581 xinference.model.llm.llm_family 45 INFO Caching from URI: /data 2025-02-04 20:37:52 2025-02-04 04:37:52,586 xinference.model.llm.llm_family 45 INFO Cache /data exists 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Detected different devices in the system: 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 2080 Ti 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 3090 Ti 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior. 2025-02-04 20:38:42 2025-02-04 04:38:42,145 xinference.core.model 66 INFO Start requests handler. 2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2025-02-04 20:38:42 ggml_cuda_init: found 1 CUDA devices: 2025-02-04 20:38:42 Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes 2025-02-04 20:38:42 llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23287 MiB free 2025-02-04 20:38:42 gguf_init_from_file: invalid magic characters '' 2025-02-04 20:38:42 llama_model_load: error loading model: llama_model_loader: failed to load model from /data/DeepSeek-R1-Distill-Qwen-14B-GGUF 2025-02-04 20:38:42 2025-02-04 20:38:42 llama_load_model_from_file: failed to load model 2025-02-04 20:38:42 2025-02-04 04:38:42,417 xinference.core.worker 45 ERROR Failed to load model DeepSeek-R1-Distill-Qwen-14B-GGUF-0 2025-02-04 20:38:42 Traceback (most recent call last): 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 908, in launch_builtin_model 2025-02-04 20:38:42 await model_ref.load() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send 2025-02-04 20:38:42 return self._process_result_message(result) 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message 2025-02-04 20:38:42 raise message.as_instanceof_cause() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send 2025-02-04 20:38:42 result = await self._run_coro(message.message_id, coro) 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro 2025-02-04 20:38:42 return await coro 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive 2025-02-04 20:38:42 return await super().on_receive(message) # type: ignore 2025-02-04 20:38:42 File "xoscar/core.pyx", line 558, in on_receive 2025-02-04 20:38:42 raise ex 2025-02-04 20:38:42 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 async with self._lock: 2025-02-04 20:38:42 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 with debug_async_timeout('actor_lock_timeout', 2025-02-04 20:38:42 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 result = await result 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 457, in load 2025-02-04 20:38:42 self._model.load() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/llama_cpp/core.py", line 140, in load 2025-02-04 20:38:42 self._llm = Llama( 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 369, in init 2025-02-04 20:38:42 internals.LlamaModel( 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_internals.py", line 56, in init 2025-02-04 20:38:42 raise ValueError(f"Failed to load model from file: {path_model}") 2025-02-04 20:38:42 ValueError: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF 2025-02-04 20:38:42 2025-02-04 04:38:42,479 xinference.core.worker 45 ERROR [request d977912c-e2f4-11ef-bd47-0242ac110003] Leave launch_builtin_model, error: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF, elapsed time: 50 s

deepseek-r1-distill-qwen-14b-awq错误信息：

awq我使用的是vllm引擎 2025-02-04 20:45:21 2025-02-04 04:45:21,157 transformers.models.auto.image_processing_auto 114 INFO Could not locate the image processor configuration file, will try to use the model config instead. 2025-02-04 20:45:21 Could not locate the image processor configuration file, will try to use the model config instead. 2025-02-04 20:45:26 INFO 02-04 04:45:26 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'. 2025-02-04 20:45:26 WARNING 02-04 04:45:26 config.py:428] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2025-02-04 20:45:26 WARNING 02-04 04:45:26 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. 2025-02-04 20:45:26 INFO 02-04 04:45:26 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512. 2025-02-04 20:45:26 2025-02-04 04:45:26,599 xinference.core.worker 45 ERROR Failed to load model deepseek-r1-distill-qwen-14b-awq-0 2025-02-04 20:45:26 Traceback (most recent call last): 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 908, in launch_builtin_model 2025-02-04 20:45:26 await model_ref.load() 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send 2025-02-04 20:45:26 return self._process_result_message(result) 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message 2025-02-04 20:45:26 raise message.as_instanceof_cause() 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send 2025-02-04 20:45:26 result = await self._run_coro(message.message_id, coro) 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro 2025-02-04 20:45:26 return await coro 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive 2025-02-04 20:45:26 return await super().on_receive(message) # type: ignore 2025-02-04 20:45:26 File "xoscar/core.pyx", line 558, in on_receive 2025-02-04 20:45:26 raise ex 2025-02-04 20:45:26 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive 2025-02-04 20:45:26 async with self._lock: 2025-02-04 20:45:26 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive 2025-02-04 20:45:26 with debug_async_timeout('actor_lock_timeout', 2025-02-04 20:45:26 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive 2025-02-04 20:45:26 result = await result 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 457, in load 2025-02-04 20:45:26 self._model.load() 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/core.py", line 304, in load 2025-02-04 20:45:26 self._engine = AsyncLLMEngine.from_engine_args(engine_args) 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 683, in from_engine_args 2025-02-04 20:45:26 engine_config = engine_args.create_engine_config() 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1143, in create_engine_config 2025-02-04 20:45:26 return VllmConfig( 2025-02-04 20:45:26 File "", line 15, in init 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 2135, in post_init 2025-02-04 20:45:26 self.quant_config = VllmConfig._get_quantization_config( 2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 2100, in _get_quantization_config 2025-02-04 20:45:26 raise ValueError( 2025-02-04 20:45:26 ValueError: [address=0.0.0.0:37339, pid=114] torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16] 2025-02-04 20:45:26 2025-02-04 04:45:26,628 xinference.core.worker 45 ERROR [request c7133e5e-e2f5-11ef-bd47-0242ac110003] Leave launch_builtin_model, error: [address=0.0.0.0:37339, pid=114] torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16], elapsed time: 55 s

Expected behavior / 期待表现

成功启动模型

Feb 04 '25 12:02 worm128

遇到同样问题

Feb 06 '25 03:02 bleakie

This issue is stale because it has been open for 7 days with no activity.

Feb 13 '25 19:02 github-actions[bot]

遇到同样问题

我的问题解决了一部分：使用AWQ和GPTQ的量化模型，在使用启动引擎是vllm时候，只需要配置参数dtype=float16可以启动，我的显卡是3090ti，不知道为什么要这个参数，不知道是不是这个int4量化模型不支持bfloat16和float32。但是llama引擎加载gguf模型，还是不行，报错 guf_init_from_file: invalid magic characters ''，Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF

Feb 14 '25 12:02 worm128

已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载这种llama引擎加载的gguf格式模型，一定要把模型路径最后加上模型文件名

Feb 16 '25 08:02 worm128

已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载这种llama引擎加载的gguf格式模型，一定要把模型路径最后加上模型文件名

llama引擎是不是比transformer快呀

Feb 22 '25 11:02 Alan-zhong

已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载这种llama引擎加载的gguf格式模型，一定要把模型路径最后加上模型文件名

llama引擎是不是比transformer快呀

是很快，和awq,gptq这些差不多

Feb 25 '25 00:02 worm128

已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载这种llama引擎加载的gguf格式模型，一定要把模型路径最后加上模型文件名

llama引擎是不是比transformer快呀

是很快，和awq,gptq这些差不多

DeepSeek-R1-Distill-qwen-7B-q4km GGUF 加了也不行呢

Feb 25 '25 02:02 ccly1996

已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载这种llama引擎加载的gguf格式模型，一定要把模型路径最后加上模型文件名

llama引擎是不是比transformer快呀

是很快，和awq,gptq这些差不多

DeepSeek-R1-Distill-qwen-7B-q4km GGUF 加了也不行呢

是不是缺少了config配置文件

Feb 27 '25 11:02 worm128

解决了吗？我试着注册qwq32b的gguf也出了这个问题

Apr 10 '25 01:04 pamdla