DeepSeek-R1-Distill-Qwen-14B-GGUF 和 deepseek-r1-distill-qwen-14b-awq 都加载失败
System Info / 系統信息
DeepSeek-R1-Distill-Qwen-14B-GGUF 和 deepseek-r1-distill-qwen-14b-awq 都加载失败 系统:Window11 显卡: 4070tis
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
Xinference版本:v1.2.1 驱动:560.94 cuda:cuda_12.3.r12.3
The command used to start Xinference / 用以启动 xinference 的命令
DeepSeek-R1-Distill-Qwen-14B-GGUF配置:
{
"version": 1,
"context_length": 12800,
"model_name": "DeepSeek-R1-Distill-Qwen-14B-GGUF",
"model_lang": [
"en",
"zh",
"ch"
],
"model_ability": [
"generate",
"chat"
],
"model_description": "This is a custom model description.",
"model_family": "deepseek-r1-distill-qwen",
"model_specs": [
{
"model_format": "ggufv2",
"model_size_in_billions": 14,
"quantizations": [
"Q6_K"
],
"model_id": null,
"model_file_name_template": "DeepSeek-R1-Distill-Qwen-14B-GGUF",
"model_file_name_split_template": null,
"quantization_parts": null,
"model_hub": "huggingface",
"model_uri": "/data",
"model_revision": null
}
],
"chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}",
"stop_token_ids": [
151643
],
"stop": [
"<|end▁of▁sentence|>"
],
"is_builtin": false
}
deepseek-r1-distill-qwen-14b-awq配置:
{
"version": 1,
"context_length": 12800,
"model_name": "deepseek-r1-distill-qwen-14b-awq",
"model_lang": [
"en",
"zh"
],
"model_ability": [
"generate",
"chat"
],
"model_description": "This is a custom model description.",
"model_family": "deepseek-r1-distill-qwen",
"model_specs": [
{
"model_format": "awq",
"model_size_in_billions": 14,
"quantizations": [
"Int4"
],
"model_id": null,
"model_hub": "huggingface",
"model_uri": "/data/deepseek-r1-distill-qwen-14b-awq",
"model_revision": null
}
],
"chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}",
"stop_token_ids": [
151643
],
"stop": [
"<|end▁of▁sentence|>"
],
"is_builtin": false
}
Reproduction / 复现过程
DeepSeek-R1-Distill-Qwen-14B-GGUF错误信息:
GGUF我使用的是llama引擎: 2025-02-04 20:37:52 2025-02-04 04:37:52,016 xinference.core.worker 45 INFO [request d977912c-e2f4-11ef-bd47-0242ac110003] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f7baf8f8270>, kwargs: model_uid=DeepSeek-R1-Distill-Qwen-14B-GGUF-0,model_name=DeepSeek-R1-Distill-Qwen-14B-GGUF,model_size_in_billions=14,model_format=ggufv2,quantization=Q6_K,model_engine=llama.cpp,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[0],download_hub=None,model_path=None,xavier_config=None 2025-02-04 20:37:52 2025-02-04 04:37:52,017 xinference.core.worker 45 INFO You specify to launch the model: DeepSeek-R1-Distill-Qwen-14B-GGUF on GPU index: [0] of the worker: 0.0.0.0:52371, xinference will automatically ignore the n_gpu option. 2025-02-04 20:37:52 2025-02-04 04:37:52,581 xinference.model.llm.llm_family 45 INFO Caching from URI: /data 2025-02-04 20:37:52 2025-02-04 04:37:52,586 xinference.model.llm.llm_family 45 INFO Cache /data exists 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Detected different devices in the system: 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 2080 Ti 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 3090 Ti 2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior. 2025-02-04 20:38:42 2025-02-04 04:38:42,145 xinference.core.model 66 INFO Start requests handler. 2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes 2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2025-02-04 20:38:42 ggml_cuda_init: found 1 CUDA devices: 2025-02-04 20:38:42 Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes 2025-02-04 20:38:42 llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23287 MiB free 2025-02-04 20:38:42 gguf_init_from_file: invalid magic characters '' 2025-02-04 20:38:42 llama_model_load: error loading model: llama_model_loader: failed to load model from /data/DeepSeek-R1-Distill-Qwen-14B-GGUF 2025-02-04 20:38:42 2025-02-04 20:38:42 llama_load_model_from_file: failed to load model 2025-02-04 20:38:42 2025-02-04 04:38:42,417 xinference.core.worker 45 ERROR Failed to load model DeepSeek-R1-Distill-Qwen-14B-GGUF-0 2025-02-04 20:38:42 Traceback (most recent call last): 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 908, in launch_builtin_model 2025-02-04 20:38:42 await model_ref.load() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send 2025-02-04 20:38:42 return self._process_result_message(result) 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message 2025-02-04 20:38:42 raise message.as_instanceof_cause() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send 2025-02-04 20:38:42 result = await self._run_coro(message.message_id, coro) 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro 2025-02-04 20:38:42 return await coro 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive 2025-02-04 20:38:42 return await super().on_receive(message) # type: ignore 2025-02-04 20:38:42 File "xoscar/core.pyx", line 558, in on_receive 2025-02-04 20:38:42 raise ex 2025-02-04 20:38:42 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 async with self._lock: 2025-02-04 20:38:42 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 with debug_async_timeout('actor_lock_timeout', 2025-02-04 20:38:42 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive 2025-02-04 20:38:42 result = await result 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 457, in load 2025-02-04 20:38:42 self._model.load() 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/llama_cpp/core.py", line 140, in load 2025-02-04 20:38:42 self._llm = Llama( 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 369, in init 2025-02-04 20:38:42 internals.LlamaModel( 2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_internals.py", line 56, in init 2025-02-04 20:38:42 raise ValueError(f"Failed to load model from file: {path_model}") 2025-02-04 20:38:42 ValueError: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF 2025-02-04 20:38:42 2025-02-04 04:38:42,479 xinference.core.worker 45 ERROR [request d977912c-e2f4-11ef-bd47-0242ac110003] Leave launch_builtin_model, error: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF, elapsed time: 50 s
deepseek-r1-distill-qwen-14b-awq错误信息:
awq我使用的是vllm引擎
2025-02-04 20:45:21 2025-02-04 04:45:21,157 transformers.models.auto.image_processing_auto 114 INFO Could not locate the image processor configuration file, will try to use the model config instead.
2025-02-04 20:45:21 Could not locate the image processor configuration file, will try to use the model config instead.
2025-02-04 20:45:26 INFO 02-04 04:45:26 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
2025-02-04 20:45:26 WARNING 02-04 04:45:26 config.py:428] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2025-02-04 20:45:26 WARNING 02-04 04:45:26 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
2025-02-04 20:45:26 INFO 02-04 04:45:26 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
2025-02-04 20:45:26 2025-02-04 04:45:26,599 xinference.core.worker 45 ERROR Failed to load model deepseek-r1-distill-qwen-14b-awq-0
2025-02-04 20:45:26 Traceback (most recent call last):
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 908, in launch_builtin_model
2025-02-04 20:45:26 await model_ref.load()
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
2025-02-04 20:45:26 return self._process_result_message(result)
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
2025-02-04 20:45:26 raise message.as_instanceof_cause()
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send
2025-02-04 20:45:26 result = await self._run_coro(message.message_id, coro)
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
2025-02-04 20:45:26 return await coro
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive
2025-02-04 20:45:26 return await super().on_receive(message) # type: ignore
2025-02-04 20:45:26 File "xoscar/core.pyx", line 558, in on_receive
2025-02-04 20:45:26 raise ex
2025-02-04 20:45:26 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
2025-02-04 20:45:26 async with self._lock:
2025-02-04 20:45:26 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
2025-02-04 20:45:26 with debug_async_timeout('actor_lock_timeout',
2025-02-04 20:45:26 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
2025-02-04 20:45:26 result = await result
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 457, in load
2025-02-04 20:45:26 self._model.load()
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/core.py", line 304, in load
2025-02-04 20:45:26 self._engine = AsyncLLMEngine.from_engine_args(engine_args)
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 683, in from_engine_args
2025-02-04 20:45:26 engine_config = engine_args.create_engine_config()
2025-02-04 20:45:26 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1143, in create_engine_config
2025-02-04 20:45:26 return VllmConfig(
2025-02-04 20:45:26 File "
Expected behavior / 期待表现
成功启动模型
遇到同样问题
This issue is stale because it has been open for 7 days with no activity.
遇到同样问题
我的问题解决了一部分: 使用AWQ和GPTQ的量化模型,在使用启动引擎是vllm时候,只需要配置参数dtype=float16可以启动,我的显卡是3090ti,不知道为什么要这个参数,不知道是不是这个int4量化模型不支持bfloat16和float32。 但是llama引擎加载gguf模型,还是不行,报错 guf_init_from_file: invalid magic characters '',Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF
已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载 这种llama引擎加载的gguf格式模型,一定要把模型路径最后加上模型文件名
已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载 这种llama引擎加载的gguf格式模型,一定要把模型路径最后加上模型文件名
llama引擎是不是比transformer快呀
已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载 这种llama引擎加载的gguf格式模型,一定要把模型路径最后加上模型文件名
llama引擎是不是比transformer快呀
是很快,和awq,gptq这些差不多
已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载 这种llama引擎加载的gguf格式模型,一定要把模型路径最后加上模型文件名
![]()
llama引擎是不是比transformer快呀
是很快,和awq,gptq这些差不多
DeepSeek-R1-Distill-qwen-7B-q4km GGUF 加了也不行呢
已解决DeepSeek-R1-Distill-Llama-8B-GGUF加载 这种llama引擎加载的gguf格式模型,一定要把模型路径最后加上模型文件名
![]()
llama引擎是不是比transformer快呀
是很快,和awq,gptq这些差不多
DeepSeek-R1-Distill-qwen-7B-q4km GGUF 加了也不行呢
是不是缺少了config配置文件
解决了吗?我试着注册qwq32b的gguf也出了这个问题