lmdeploy InternVL3_5-30B-A3B部署失败

我现在在一个5*4090D的服务器上用lmdeploy框架部署InternVL3_5-30B-A3B大模型，依次执行指令如下：

docker run -itd --name imdeploy --runtime nvidia --gpus '"device=1,2,3,4"' -e NCCL_P2P_DISABLE=1 -e NCCL_IB_DISABLE=1 -v /llm/models:/models -p 23333:23333 --ipc=host openmmlab/lmdeploy:latest-cu12
docker exec -it imdeploy bash
lmdeploy serve api_server /models/InternVL3_5-30B-A3B --backend pytorch --tp 4

结果在执行第三条指令时报错，输出信息如下： 2025-09-01 09:20:38,450 ERROR worker.py:429 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.warmup_dist() (pid=614, ip=10.11.0.5, actor_id=6543c39add6fefe972f1dad601000000, repr=<lmdeploy.pytorch.engine.executor.ray_executor.RayWorkerWrapper object at 0x7fd83643d1e0>) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/executor/ray_executor.py", line 206, in warmup_dist all_reduce(tmp) File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/distributed.py", line 248, in all_reduce return dist.all_reduce(tensor, op, group, async_op) File "/opt/py3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(*args, **kwargs) File "/opt/py3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2810, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices'

GPT说是多张卡之间不支持 P2P 通信导致的，但是我在容器内部设置了相关变量，再次运行时还是会报错： export NCCL_P2P_DISABLE=1 export NCCL_IGNORE_DISABLED_P2P=1 export NCCL_IB_DISABLE=1 export NCCL_NET=Socket export NCCL_SHM_DISABLE=0 export NCCL_P2P_LEVEL=SYS export NCCL_NVLS_ENABLE=0

无语了我都，有没有人遇到相同的问题啊

Sep 01 '25 10:09 CHNCZL

Environment

Please run lmdeploy check_env to collect necessary environment information and paste it here. You may add addition that may be helpful for locating the problem, such as Which model are you using? How you installed PyTorch [e.g., pip, conda, source] Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Sep 01 '25 12:09 grimoire

Environment

Please run lmdeploy check_env to collect necessary environment information and paste it here. You may add addition that may be helpful for locating the problem, such as Which model are you using? How you installed PyTorch [e.g., pip, conda, source] Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

The Environment are listed below: sys.platform: linux Python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA GeForce RTX 4090 D CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 PyTorch: 2.7.1+cu126 PyTorch compiling details: PyTorch built with:

GCC 11.2
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.6
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 90.5.1
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=e2d141dbde55c2a4370fac5165b0561b6af4798b, CUDA_VERSION=12.6, CUDNN_VERSION=9.5.1, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.22.1+cu126 LMDeploy: 0.9.2+ transformers: 4.53.3 fastapi: 0.116.1 pydantic: 2.11.7 triton: 3.3.1 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX SYS SYS 0-27,56-83 0 N/A GPU1 PIX X SYS SYS 0-27,56-83 0 N/A GPU2 SYS SYS X PIX 28-55,84-111 1 N/A GPU3 SYS SYS PIX X 28-55,84-111 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Sep 01 '25 14:09 CHNCZL

PLS update your LMDeploy to v0..10.0+ for InternVL3_5.

Sep 19 '25 05:09 kerlion