lmdeploy [Bug] RuntimeError: CUDA error: an illegal memory access was encountered

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

RuntimeError: CUDA error: an illegal memory access was encountered 我发现这个错误在多人调用接口的时候，容易发生，接口调用代码见【reproduction】运行环境（nvidia-smi显示的显卡驱动 550.144.03； cuda驱动 CUDA Version: 12.4）：

Package                           Version       Editable project location
--------------------------------- ------------- -------------------------
accelerate                        1.4.0
addict                            2.4.0
aiohappyeyeballs                  2.4.6
aiohttp                           3.11.12
aiohttp-cors                      0.7.0
aiosignal                         1.3.2
airportsdata                      20241001
annotated-types                   0.7.0
anthropic                         0.46.0
anyio                             4.8.0
astor                             0.8.1
asttokens                         3.0.0
async-timeout                     5.0.1
attrs                             25.1.0
baidu-aip                         4.16.13
bcrypt                            4.3.0
beautifulsoup4                    4.13.3
bitsandbytes                      0.45.3
blake3                            1.0.4
blinker                           1.9.0
Brotli                            1.1.0
cachetools                        5.5.2
certifi                           2025.1.31
cffi                              1.17.1
cfgv                              3.4.0
chardet                           5.2.0
charset-normalizer                3.4.1
click                             8.1.8
cloudpickle                       3.1.1
colorful                          0.5.6
colossalai                        0.4.9
compressed-tensors                0.9.1
contexttimer                      0.3.3
contourpy                         1.3.1
cryptography                      44.0.2
cuda-bindings                     12.8.0
cuda-python                       12.8.0
cycler                            0.12.1
datasets                          3.3.2
decorator                         5.1.1
decord                            0.6.0
deepspeed                         0.15.4
Deprecated                        1.2.18
depyf                             0.18.0
diffusers                         0.29.0
dill                              0.3.8
diskcache                         5.6.3
distlib                           0.3.9
distro                            1.9.0
docstring_parser                  0.16
einops                            0.8.1
exceptiongroup                    1.2.2
executing                         2.2.0
fabric                            3.2.2
fastapi                           0.115.8
filelock                          3.17.0
fire                              0.7.0
flashinfer-python                 0.2.1.post2
Flask                             3.1.0
Flask-Cors                        5.0.0
fonttools                         4.56.0
frozenlist                        1.5.0
fsspec                            2024.12.0
galore-torch                      1.0
gevent                            24.11.1
gguf                              0.10.0
gmpy2                             2.1.5
google                            3.0.0
google-api-core                   2.24.1
google-auth                       2.38.0
googleapis-common-protos          1.68.0
greenlet                          3.1.1
grpcio                            1.70.0
h11                               0.14.0
h2                                4.2.0
hf_transfer                       0.1.9
hjson                             3.1.0
hpack                             4.1.0
httpcore                          1.0.7
httptools                         0.6.4
httpx                             0.28.1
huggingface-hub                   0.29.1
hyperframe                        6.1.0
identify                          2.6.9
idna                              3.10
importlib_metadata                8.6.1
iniconfig                         2.0.0
interegular                       0.3.3
invoke                            2.2.0
ipdb                              0.13.13
ipython                           8.32.0
itsdangerous                      2.2.0
jedi                              0.19.2
Jinja2                            3.1.5
jiter                             0.8.2
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
kiwisolver                        1.4.8
lark                              1.2.2
litellm                           1.61.13
llvmlite                          0.44.0
lm-format-enforcer                0.10.10
lmdeploy                          0.6.5         /disk2/elivate/lmdeploy
loguru                            0.7.3
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
matplotlib                        3.10.0
matplotlib-inline                 0.1.7
mdurl                             0.1.2
mistral_common                    1.5.3
mmengine-lite                     0.10.6
modelscope                        1.23.1
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.19.0
multidict                         6.1.0
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.4.2
ninja                             1.11.1.3
nodeenv                           1.9.1
numba                             0.61.0
numpy                             1.26.4
nvidia-cublas-cu12                12.4.5.8
nvidia-cuda-cupti-cu12            12.4.127
nvidia-cuda-nvrtc-cu12            12.4.127
nvidia-cuda-runtime-cu12          12.4.127
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.2.1.3
nvidia-curand-cu12                10.3.5.147
nvidia-cusolver-cu12              11.6.1.9
nvidia-cusparse-cu12              12.3.1.170
nvidia-ml-py                      12.570.86
nvidia-nccl-cu12                  2.21.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.4.127
openai                            1.63.2
opencensus                        0.11.4
opencensus-context                0.1.3
opencv-python-headless            4.11.0.86
orjson                            3.10.15
outlines                          0.1.11
outlines_core                     0.1.26
packaging                         24.2
pandas                            2.2.3
paramiko                          3.5.1
parso                             0.8.4
partial-json-parser               0.2.1.1.post5
peft                              0.11.1
pexpect                           4.9.0
pillow                            11.1.0
pip                               25.0.1
platformdirs                      4.3.6
pluggy                            1.5.0
plumbum                           1.9.0
pre_commit                        4.1.0
prometheus_client                 0.21.1
prometheus-fastapi-instrumentator 7.0.2
prompt_toolkit                    3.0.50
propcache                         0.3.0
proto-plus                        1.26.0
protobuf                          5.29.3
psutil                            7.0.0
ptyprocess                        0.7.0
pure_eval                         0.2.3
py-cpuinfo                        9.0.0
py-spy                            0.4.0
pyairports                        2.1.1
pyarrow                           19.0.1
pyasn1                            0.6.1
pyasn1_modules                    0.4.1
pybind11                          2.13.6
pycountry                         24.6.1
pycparser                         2.22
pydantic                          2.10.6
pydantic_core                     2.27.2
Pygments                          2.19.1
PyNaCl                            1.5.0
pynvml                            12.0.0
pyparsing                         3.2.1
PySocks                           1.7.1
pytest                            8.3.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-multipart                  0.0.20
pytz                              2025.1
PyYAML                            6.0.2
pyzmq                             26.2.1
ray                               2.42.1
referencing                       0.36.2
regex                             2024.11.6
requests                          2.32.3
rich                              13.9.4
rpds-py                           0.23.1
rpyc                              6.0.0
rsa                               4.9
safetensors                       0.4.5
sentencepiece                     0.2.0
setproctitle                      1.3.4
setuptools                        75.8.0
sgl-kernel                        0.0.3.post6
shortuuid                         1.0.13
shtab                             1.7.1
six                               1.17.0
smart-open                        7.1.0
sniffio                           1.3.1
soupsieve                         2.6
stack-data                        0.6.3
starlette                         0.45.3
sympy                             1.13.1
termcolor                         2.5.0
tiktoken                          0.9.0
tokenizers                        0.20.3
tomli                             2.2.1
torch                             2.5.1
torchao                           0.8.0
torchaudio                        2.5.1
torchvision                       0.20.1
tqdm                              4.67.1
traitlets                         5.14.3
transformers                      4.46.3
triton                            3.0.0
trl                               0.8.6
typeguard                         4.4.2
typing_extensions                 4.12.2
tyro                              0.9.16
tzdata                            2025.1
urllib3                           2.3.0
uvicorn                           0.29.0
uvloop                            0.21.0
virtualenv                        20.29.2
vllm                              0.7.2
watchfiles                        1.0.4
wcwidth                           0.2.13
websockets                        15.0
Werkzeug                          3.1.3
wheel                             0.45.1
wrapt                             1.17.2
xformers                          0.0.28.post3
xgrammar                          0.1.10
xxhash                            3.5.0
yapf                              0.43.0
yarl                              1.18.3
zipp                              3.21.0
zope.event                        5.0
zope.interface                    7.2
zstandard                         0.19.0

Reproduction

运行命令：

lmdeploy serve api_server /disk2/elivate/DeepSeek/DeepSeek-R1 --tp 8 --backend pytorch --chat-template deepseek \
        --cache-max-entry-count 0.5 --server-name 0.0.0.0 --server-port 23333

接口调用代码：

#coding:utf-8
from openai import OpenAI
import httpx
#运行代码和服务器是一个ip时，需要讲0.0.0.0改成服务端ip（端口号23333可能也需要改成指定的端口）
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://0.0.0.0:23333/v1",
    http_client=httpx.Client(verify=False)
)
model_name = client.models.list().data[0].id

input_format = '''
任何输出都要有思考过程，输出内容必须以 "<think>\n\n嗯" 开头。仔细揣摩用户意图，之后提供逻辑清晰且内容完整的回答，可以使用Markdown格式优化信息呈现。\n\n

{}'''
#是否深度思考（prompt强制输出思考内容）
use_think=True
input_str = '输入'
if use_think:
  cur_message = [{"role": "user", "content": input_format.format(input_str)}]
else:
  cur_message = [{"role": "user", "content": input_str}]

# #流式输出（视觉效果好）
# response = client.chat.completions.create(
#     model=model_name,
#     messages=cur_message,
#     temperature=0.8,
#     top_p=0.8,
#     stream=True,
#   )
# text=''
# for chunk in response:
#   # 判断回复是否非空
#   if chunk.choices[0].delta.content:
#       text+=chunk.choices[0].delta.content
#       print(chunk.choices[0].delta.content, end='')  # 设置 end='' 实现不换行，视觉上拼接输出
# #打印
# print('\n',text)

#非流式输出
response = client.chat.completions.create(
  model=model_name,
  messages=cur_message,
  temperature=0.6,
  top_p=0.8,
)
#打印
print(response.choices[0].message.content)

Environment

/disk2/eliviate/lmdeploy/lmdeploy/cli/entrypoint.py
sys.platform: linux
Python: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
PyTorch: 2.5.1+cu124
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.20.1+cu124
LMDeploy: 0.6.5+
transformers: 4.46.3
gradio: Not Found
fastapi: 0.115.8
pydantic: 2.10.6
triton: 3.0.0
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	48-95,144-191	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	48-95,144-191	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	48-95,144-191	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	48-95,144-191	1		N/A
NIC0	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS				
NIC1	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS				
NIC2	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS				
NIC3	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS				
NIC4	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS				
NIC5	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS				
NIC6	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE				
NIC7	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE				
NIC8	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX				
NIC9	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

Error traceback

错误输出：

Traceback (most recent call last):
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 482, in _start_tp_process
    func(rank, *args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 439, in _tp_model_loop
    model_forward(
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 156, in model_forward
    output = model(**input_dict)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/backends/cuda/graph_runner.py", line 149, in __call__
    return self.model(**kwargs)
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/models/deepseek_v2.py", line 702, in forward
    hidden_states = self.model(
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/models/deepseek_v2.py", line 654, in forward
    hidden_states, residual = decoder_layer(
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/models/deepseek_v2.py", line 555, in forward
    hidden_states = self.self_attn(
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disk2/eliviate/lmdeploy/lmdeploy/pytorch/models/deepseek_v2.py", line 256, in forward
    query_states[..., nope_size:] = q_pe
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank6]:[E314 14:33:46.584848437 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0f0116c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0f011166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0f01534a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0eb7025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f0eb702a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f0eb7031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f0eb703361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f0f01c4a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f0f02370ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f0f02402850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank4]:[E314 14:33:46.585238329 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5e2fb6c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5e2fb166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5e2ffd4a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f5de5a25726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f5de5a2a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f5de5a31b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5de5a3361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f5e306ea5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f5e30e10ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f5e30ea2850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0f0116c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0f011166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0f01534a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0eb7025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f0eb702a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f0eb7031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f0eb703361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f0f01c4a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f0f02370ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f0f02402850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0f0116c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f0eb6ca071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f0f01c4a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f0f02370ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f0f02402850 in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5e2fb6c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5e2fb166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5e2ffd4a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f5de5a25726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f5de5a2a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f5de5a31b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5de5a3361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f5e306ea5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f5e30e10ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f5e30ea2850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5e2fb6c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f5de56a071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f5e306ea5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f5e30e10ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f5e30ea2850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E314 14:33:46.588302933 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4c560b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4c560636e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4c561a5a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f4c0c025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f4c0c02a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f4c0c031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4c0c03361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f4c56b785c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f4c5729eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f4c57330850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank5]:[E314 14:33:46.588801693 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f581a0b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f581a0636e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f581a1a5a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f57d0025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f57d002a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f57d0031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f57d003361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f581abcf5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f581b2f5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f581b387850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E314 14:33:46.588877826 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff69c16c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ff69c1166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ff69c544a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ff652025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ff65202a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ff652031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff65203361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7ff69cc5a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7ff69d380ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7ff69d412850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E314 14:33:46.589305831 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd56276c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5627166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd562b73a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd518625726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fd51862a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fd518631b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd51863361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7fd5632895c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7fd5639afac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fd563a41850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank7]:[E314 14:33:46.589658386 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa19536c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa1953166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa19578ca18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa14b225726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fa14b22a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fa14b231b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa14b23361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7fa195ea25c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7fa1965c8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fa19665a850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4c560b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4c560636e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4c561a5a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f4c0c025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f4c0c02a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f4c0c031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4c0c03361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f4c56b785c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f4c5729eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f4c57330850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4c560b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f4c0bca071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f4c56b785c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f4c5729eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f4c57330850 in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f581a0b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f581a0636e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f581a1a5a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f57d0025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f57d002a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f57d0031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f57d003361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f581abcf5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f581b2f5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f581b387850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f581a0b9446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f57cfca071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f581abcf5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f581b2f5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f581b387850 in /lib/x86_64-linux-gnu/libc.so.6)
  what():  
[PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff69c16c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ff69c1166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ff69c544a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ff652025726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ff65202a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ff652031b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff65203361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7ff69cc5a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7ff69d380ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7ff69d412850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff69c16c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7ff651ca071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7ff69cc5a5c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7ff69d380ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7ff69d412850 in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd56276c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5627166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd562b73a18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd518625726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fd51862a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fd518631b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd51863361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7fd5632895c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7fd5639afac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fd563a41850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd56276c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7fd5182a071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7fd5632895c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7fd5639afac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7fd563a41850 in /lib/x86_64-linux-gnu/libc.so.6)
  what():  
[PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa19536c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa1953166e4 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa19578ca18 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa14b225726 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fa14b22a3f0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fa14b231b5a in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa14b23361d in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7fa195ea25c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7fa1965c8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fa19665a850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa19536c446 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7fa14aea071b in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7fa195ea25c0 in /disk2/condaenvs/deepseek/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7fa1965c8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7fa19665a850 in /lib/x86_64-linux-gnu/libc.so.6)

/disk2/condaenvs/deepseek/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Mar 14 '25 08:03 github-eliviate

可以试试看我们最新的 main branch，有修复过一个 moe kernel 的边界条件检查

Mar 14 '25 10:03 grimoire

@github-eliviate @grimoire 这个问题解决了吗？我今天安装了最新的lmdeply==0.7.2，调用QwenVL2-5_72B，遇到了同样的问题。

Mar 20 '25 06:03 zhyxun

@zhyxun 麻烦提供一下复现方式还有环境信息

Mar 20 '25 08:03 grimoire

@grimoire

环境信息

NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.4

python版本是Python 3.10.12

Package Version

accelerate 0.33.0 addict 2.4.0 aiofiles 23.2.1 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 airportsdata 20250224 annotated-types 0.7.0 anyio 4.4.0 argcomplete 3.6.0 async-timeout 4.0.3 attrs 24.2.0 av 14.2.0 certifi 2024.7.4 cfgv 3.4.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.1.1 cmake 3.30.2 contourpy 1.2.1 cycler 0.12.1 datasets 2.21.0 decord 0.6.0 dill 0.3.8 diskcache 5.6.3 distlib 0.3.9 distro 1.9.0 einops 0.8.0 exceptiongroup 1.2.2 fastapi 0.112.2 ffmpy 0.4.0 filelock 3.13.1 fire 0.6.0 flash-attn 2.6.3 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2024.2.0 genson 1.3.0 gradio 4.42.0 gradio_client 1.3.0 grpcio 1.66.0 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.29.3 identify 2.6.9 idna 3.8 importlib_metadata 8.4.0 importlib_resources 6.4.4 interegular 0.3.3 iso3166 2.1.1 Jinja2 3.1.3 jiter 0.5.0 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.5 lark 1.2.2 lmdeploy 0.7.2 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 mmengine-lite 0.10.4 mpmath 1.3.0 msgpack 1.1.0 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.2.1 nodeenv 1.9.1 numpy 1.26.3 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 openai 1.42.0 orjson 3.10.7 outlines 0.2.1 outlines_core 0.1.26 packaging 24.1 pandas 2.2.2 partial-json-parser 0.2.1.1.post5 peft 0.11.1 pillow 10.2.0 pip 25.0.1 pipx 1.7.1 platformdirs 4.2.2 pre_commit 4.2.0 protobuf 4.25.4 psutil 6.0.0 pyarrow 17.0.0 pybind11 2.13.1 pydantic 2.8.2 pydantic_core 2.20.1 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-multipart 0.0.9 python-rapidjson 1.20 pytz 2024.1 PyYAML 6.0.2 qwen-vl-utils 0.0.8 ray 2.43.0 referencing 0.36.2 regex 2024.7.24 requests 2.32.3 rich 13.7.1 rpds-py 0.23.1 ruff 0.6.2 safetensors 0.4.4 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 sniffio 1.3.1 starlette 0.38.2 sympy 1.13.1 termcolor 2.4.0 tiktoken 0.7.0 timm 1.0.9 tokenizers 0.21.1 tomli 2.0.1 tomlkit 0.12.0 torch 2.5.1 torchvision 0.20.1 tqdm 4.66.5 transformers 4.49.0 transformers-stream-generator 0.0.5 triton 3.1.0 tritonclient 2.48.0 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 userpath 1.9.2 uvicorn 0.30.6 virtualenv 20.29.3 websockets 12.0 wheel 0.44.0 xxhash 3.5.0 yapf 0.40.2 yarl 1.9.4 zipp 3.20.0

复现

代码：

from torch.utils.data import Dataset
from openai import OpenAI
import base64
import json
from torch.utils.data.dataloader import DataLoader

def caption_collate_fn(file_meta_batch):
    return file_meta_batch


class QA_dataset(Dataset):

    def __init__(
        self,
        base_url: str = None,
    ) -> None:
        
        self.image_paths = ["xxx.jpg"] * 10000 ## 可以设置为某一张图片
        
        self.prompt = "请描述这张图片"
        self.client = OpenAI(api_key='AABBCCDD', base_url=base_url)
        self.server_model_name = self.client.models.list().data[0].id
        print("LLM/VLM Model:", self.server_model_name)

    def load_bytes_from_image(self, image_path):
        with open(image_path, "rb") as image_file:
            encoded_string = base64.b64encode(image_file.read())

        return encoded_string.decode("utf-8")  # 将bytes转换为字符串

    def __len__(self) -> int:
        """Get length of current self.files."""
        return len(self.image_paths)

    def __getitem__(self, index):
        """Get item."""
        image_path = self.image_paths[index]

        try:
            base64_data = self.load_bytes_from_image(image_path)
        except:
            return None

        try:
            response = self.client.chat.completions.create(
                model=self.server_model_name,
                messages=[{
                    'role': 'user',
                    'content': [
                        {
                            'type': 'text',
                            'text': self.prompt,
                        },
                        {
                            'type': 'image_url',
                            'image_url': {
                                'url':
                                f"data:image/jpeg;base64,{base64_data}",
                            },
                        }
                    ],
                }],
                temperature=0.8,
                top_p=0.95,
            )

            # paser the response
            response_txt = response.choices[0].message.content
            print(response_txt)
            message = {
                "conversations": response_txt,
                "image_path": image_path,
            }
        except:
            return None

        return message



def main():
    ## load dataset
    dataset = QA_dataset(
                    base_url="http://172.24.208.140:23333/v1",
    )

    dataloader = DataLoader(
            dataset,
            batch_size=64,
            shuffle=False,
            num_workers=64,
            collate_fn=caption_collate_fn,
            prefetch_factor=64,
            drop_last=False,
        )

    fout = open("./output.jsonl", "w")
    for batch_id, meta_data in enumerate(dataloader):
        for sample in meta_data:
            if sample is not None:
                fout.write(json.dumps(sample, ensure_ascii=False)+"\n")

if __name__ == "__main__":
    main()

运行命令： lmdeploy serve api_server ./pretrained_vlm/Qwen2.5-VL-72B-Instruct --server-port 23333 --tp 4 --cache-max-entry-count 0.4

Mar 21 '25 05:03 zhyxun

@zhyxun 我这里没办法复现。能提供一张确定能复现的图片数据吗？还有大约跑多少数据会发生错误？

Mar 21 '25 07:03 grimoire

@grimoire 试试这样图片呢？跑大概五六分钟就会出现错误，有时候甚至更短，两三分钟就会报错你跑着没问题吗？

Mar 21 '25 09:03 zhyxun

@zhyxun https://github.com/InternLM/lmdeploy/pull/3307 试试看

Mar 23 '25 07:03 grimoire

@github-eliviate @grimoire 这个问题解决了吗？我今天安装了最新的lmdeply==0.7.2，调用QwenVL2-5_72B，遇到了同样的问题。

安装之后这个错误还没出现过，研发的调用时服务端会报另一个错误

lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: GeneratorExit

不过这个错误不会引起服务端退出

Mar 24 '25 07:03 github-eliviate

lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: GeneratorExit

这个一般是请求连接相关的错，和引擎关系不大，@AllentDan 能不能帮忙看下

Mar 24 '25 07:03 grimoire

lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: GeneratorExit

这个一般是客户端请求中断可能会产生，不影响服务

Mar 24 '25 08:03 AllentDan

@zhyxun #3307 试试看太感谢了，这个方法解决了我的问题。

Mar 25 '25 12:03 zhyxun

lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: GeneratorExit

这个一般是客户端请求中断可能会产生，不影响服务

的确不影响服务，研发反馈同样的代码，有时候可以返回，有时候不能返回，区别只是提交的内容不同，并没有主动中断请求的情况

Mar 28 '25 02:03 github-eliviate

我也遇到了这个问题，使用 Qwen2.5VL 32B 时

Apr 09 '25 14:04 natsunoshion

@github-eliviate @natsunoshion 是否使用了 proxy 功能？如果连接超时，是有可能中断的。

内容不同

内容不同的话可以贴一个例子？

Apr 14 '25 02:04 AllentDan