inference 部署deepseek蒸馏版和量化版问题

System Info / 系統信息

xinference 1.4.1 llama_cpp_python 0.3.8 vllm 0.7.2 cuda 12.4 操作系统是Linux

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[x] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

1.4.1与1.3.1.post1都试过

The command used to start Xinference / 用以启动 xinference 的命令

1.xinference launch --model-engine llama.cpp --model-name deepseek-r1 --size-in-billions 671 --model-format ggufv2 --quantization UD-IQ1_M --n-gpu 4 2.页面选择部署另外，想请教一下命令行怎么加入vllm的额外参数？

Reproduction / 复现过程

1.部署deepseek量化版1.73bit遇到问题：用llamacpp没加载到显卡上，而到了cpu： xinference launch --model-engine llama.cpp --model-name deepseek-r1 --size-in-billions 671 --model-format ggufv2 --quantization UD-IQ1_M --n-gpu 4

请问怎么解决该问题，我已按照步骤装了英伟达版的llamacpp： CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

2.部署deepseek蒸馏版llama-70B遇到问题：本身vllm可以部deepseek-r1-distill-llama-70B，但是xinference部不了，用的是和vllm同样的配置，我的配置如下：

一直出现分布式：

希望有大佬回答，非常感谢！

Expected behavior / 期待表现

以上2个模型正常部署。

Apr 16 '25 02:04 Longleaves

感觉xinference在deepseek-r1发布后对模型的支持一直不行，不是不能用就是跑起来奇慢无比，纯纯的cpu感。我已经再次转投ollama了，虽然功能有限，但好歹部署一个是一个，没这么多幺蛾子。

Apr 16 '25 13:04 sunisstar

根据错误提示和我使用sglang有类似的问题，我的解决办法是在分布式启动时，在第一个worker的节点设置环境变量VLLM_HOST_IP=第一个worker的IP，然后再启动worker。应该就可以了。

Apr 22 '25 12:04 opopnhwth

This issue is stale because it has been open for 7 days with no activity.

Apr 29 '25 19:04 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

May 04 '25 19:05 github-actions[bot]