inference BUG: NCCL error:

使用v0.12.0docker镜像部署，启动命令如下： sudo docker run -d -v /home/tskj/MOD/:/home/MOD/ -e XINFERENCE_HOME=/home/MOD -p 9997:9997 --gpus all xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level debug 我是8卡，选择8后，模型报错： 2024-06-12 05:08:11,767 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f013c4aa700>_uid': 'gpt-3.5-turbo-1-0', 'model_name': 'Qwen1.5-110B-Chat', 'model_size_in_billions': 110, 'model_format': 'pytorch', 'quantization': 'none', 'model_engl_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'gpu_memory_utilization': 0.9, 'max_model_len': 32768} 2024-06-12 05:08:11,767 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model gpt-3.5-turbo-1-0 2024-06-12 05:08:15,436 xinference.model.llm.core 95 DEBUG Launching gpt-3.5-turbo-1-0 with VLLMChatModel 2024-06-12 05:08:15,437 xinference.model.llm.llm_family 95 INFO Caching from URI: /home/MOD/Qwen/Qwen1.5-110B-Chat 2024-06-12 05:08:15,437 xinference.model.llm.llm_family 95 INFO Cache /home/MOD/Qwen/Qwen1.5-110B-Chat exists 2024-06-12 05:08:15,461 xinference.model.llm.vllm.core 210 INFO Loading gpt-3.5-turbo with following model config: {'gpu_memory_utilization': 0.9, 'max 'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'max_num_seqs': 256, 'quantization': Nose. Lora count: 0. 2024-06-12 05:08:17,542 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67067904 bytes avharm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by p10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2024-06-12 05:08:18,666 INFO worker.py:1753 -- Started a local Ray instance. INFO 06-12 05:08:20 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/MOD/Qwen/Qwen1.5-110B-Chat', speculative_config=None, D/Qwen/Qwen1.5-110B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=Truat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforcache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_modeln/Qwen1.5-110B-Chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-12 05:08:45 utils.py:618] Found nccl from library libnccl.so.2 INFO 06-12 05:08:45 pynccl.py:65] vLLM is using nccl==2.20.5 (RayWorkerWrapper pid=6318) INFO 06-12 05:08:45 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=6318) INFO 06-12 05:08:45 pynccl.py:65] vLLM is using nccl==2.20.5 ERROR 06-12 05:08:46 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution. ERROR 06-12 05:08:46 worker_base.py:148] Traceback (most recent call last): ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method ERROR 06-12 05:08:46 worker_base.py:148] return executor(*args, **kwargs) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device ERROR 06-12 05:08:46 worker_base.py:148] init_worker_distributed_environment(self.parallel_config, self.rank, ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_envir ERROR 06-12 05:08:46 worker_base.py:148] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_par ERROR 06-12 05:08:46 worker_base.py:148] initialize_model_parallel(tensor_model_parallel_size, ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 191, in initialize_model ERROR 06-12 05:08:46 worker_base.py:148] _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 94, in __in ERROR 06-12 05:08:46 worker_base.py:148] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244nk ERROR 06-12 05:08:46 worker_base.py:148] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223 ERROR 06-12 05:08:46 worker_base.py:148] raise RuntimeError(f"NCCL error: {error_str}") ERROR 06-12 05:08:46 worker_base.py:148] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] Traceback (most recent call last): (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] return executor(*args, **kwargs) (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in i (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] init_worker_distributed_environment(self.parallel_config, self.rank, (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in ited_environment (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", lmodel_parallel_initialized (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] initialize_model_parallel(tensor_model_parallel_size, (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", lize_model_parallel (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/, in init (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/ line 244, in ncclCommInitRank (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/ line 223, in NCCL_CHECK (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] raise RuntimeError(f"NCCL error: {error_str}") (RayWorkerWrapper pid=6318) ERROR 06-12 05:08:46 worker_base.py:148] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details

服务器nccl如下：chatchat) yqga@dnb:~$ dpkg -l|grep nccl

ii libnccl-dev 2.21.5-1+cuda12.5 amd64 NVIDIA Collective Communication Library (NCCL) Development Files ii libnccl2 2.21.5-1+cuda12.5 amd64 NVIDIA Collective Communication Library (NCCL) Runtime ii nccl-local-repo-ubuntu2204-2.21.5-cuda12.5 1.0-1 amd64 nccl-local repository configuration files

Jun 12 '24 06:06 ye7love7

2024-06-12_131420 英伟达配置如图

Jun 12 '24 06:06 ye7love7

+1

Jun 17 '24 08:06 linqingxu

很急啊，这个产品挺好的，想用，求回复！！！

---Original--- From: @.> Date: Mon, Jun 17, 2024 16:17 PM To: @.>; Cc: @.@.>; Subject: Re: [xorbitsai/inference] BUG: NCCL error: (Issue #1622)

+1

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Jun 17 '24 11:06 ye7love7

+1，以前的版本vllm可以部署，现在都不可以了，也是报nccl错误。单卡上用vllm可以，但是选择多卡后就出现nccl错误了。

transform可以正常部署（单卡多卡都可以）。

#环境信息如下

root@41be57132056:/workspace# pip list | grep nccl nvidia-nccl-cu12 2.20.5 root@41be57132056:/workspace# pip list | grep torch torch 2.3.0 torchaudio 2.3.0 torchelastic 0.2.2 torchvision 0.18.0 vector-quantize-pytorch 1.14.24 root@41be57132056:/workspace#

(base) root@bm-2203ajt:~# dpkg -l|grep nccl ii libnccl-dev 2.19.3-1+cuda12.3 amd64 NVIDIA Collective Communication Library (NCCL) Development Files ii libnccl2 2.19.3-1+cuda12.3 amd64 NVIDIA Collective Communication Library (NCCL) Runtime (base) root@bm-2203ajt:~#

#错误如下：

2024-06-19 01:57:01,903 xinference.api.restful_api 1 ERROR [address=0.0.0.0:41855, pid=35109] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-jaXv5Z (size 5767520) Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 770, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 837, in launch_builtin_model await _launch_model() File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 801, in _launch_model await _launch_one_model(rep_model_uid) File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 782, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 665, in launch_builtin_model await model_ref.load() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 277, in load self._model.load() File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 230, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init self.model_executor = executor_class( File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 317, in init super().init(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor self._init_workers_ray(placement_group) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray self._run_workers("init_device") File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method raise e File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method return executor(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment init_distributed_environment(parallel_config.world_size, rank, File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 122, in init_distributed_environment torch.distributed.all_reduce(data) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: [address=0.0.0.0:41855, pid=35109] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error:

Jun 19 '24 02:06 RandyChen1985

我也是，想换新版本，看它出了fncall

---Original--- From: @.> Date: Wed, Jun 19, 2024 10:00 AM To: @.>; Cc: @.@.>; Subject: Re: [xorbitsai/inference] BUG: NCCL error: (Issue #1622)

+1，以前的版本vllm可以部署，现在都不可以了，transform可以部署。也是报nccl错误。

2024-06-19 01:57:01,903 xinference.api.restful_api 1 ERROR [address=0.0.0.0:41855, pid=35109] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-jaXv5Z (size 5767520) Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 770, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive � raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 837, in launch_builtin_model await _launch_model() File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 801, in _launch_model await _launch_one_model(rep_model_uid) File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 782, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 665, in launch_builtin_model await model_ref.load() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive � raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 277, in load self._model.load() File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 230, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init self.model_executor = executor_class( File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 317, in init super().init(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor self._init_workers_ray(placement_group) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray self._run_workers("init_device") File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method raise e File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method return executor(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment init_distributed_environment(parallel_config.world_size, rank, File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 122, in init_distributed_environment torch.distributed.all_reduce(data) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: [address=0.0.0.0:41855, pid=35109] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error:

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Jun 19 '24 02:06 ye7love7

我好像找到原因了，docker 启动的时候加个参数 --shm-size 20g，如下：

docker run -d
-e XINFERENCE_MODEL_SRC=modelscope
-v /data/xinference_llm/.xinference:/root/.xinference
-v /data/xinference_llm/.cache/huggingface:/root/.cache/huggingface
-v /data/xinference_llm/.cache/modelscope:/root/.cache/modelscope
-p 9997:9997
--gpus all
--name xinference
--shm-size 20g \ xprobe/xinference:v0.12.0
xinference-local -H 0.0.0.0 --log-level debug

环境

root@0cf3d74a2d6f:/workspace# pip list | grep nccl nvidia-nccl-cu12 2.20.5 root@0cf3d74a2d6f:/workspace# pip list | grep torch torch 2.3.0 torchaudio 2.3.0 torchelastic 0.2.2 torchvision 0.18.0 vector-quantize-pytorch 1.14.24 root@0cf3d74a2d6f:/workspace#

启动日志。。

2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f0f0d780680>,), kwargs: {'model_uid': 'qwen2-chat-72b-1-0', 'model_name': 'qwen2-instruct', 'model_size_in_billions': 72, 'model_format': 'pytorch', 'quantization': 'none', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None} 2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model qwen2-chat-72b-1-0 2024-06-19 06:19:07,229 xinference.model.llm.core 95 DEBUG Launching qwen2-chat-72b-1-0 with VLLMChatModel 2024-06-19 06:19:07,229 xinference.model.llm.llm_family 95 INFO Caching from Modelscope: qwen/Qwen2-72B-Instruct 2024-06-19 06:19:07,230 xinference.model.llm.llm_family 95 INFO Cache /root/.xinference/cache/qwen2-instruct-pytorch-72b exists 2024-06-19 06:19:07,251 xinference.model.llm.vllm.core 9209 INFO Loading qwen2-chat-72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0. 2024-06-19 06:19:10,272 INFO worker.py:1753 -- Started a local Ray instance. INFO 06-19 06:19:11 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/root/.xinference/cache/qwen2-instruct-pytorch-72b', speculative_config=None, tokenizer='/root/.xinference/cache/qwen2-instruct-pytorch-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/.xinference/cache/qwen2-instruct-pytorch-72b) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerWrapper pid=16627) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 06-19 06:19:34 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16627) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 6x across cluster] (RayWorkerWrapper pid=17026) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster] INFO 06-19 06:19:44 distributed_gpu_executor.py:56] # GPU blocks: 4316, # CPU blocks: 6553 (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (RayWorkerWrapper pid=16937) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB [repeated 6x across cluster] INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. IN

Jun 19 '24 06:06 RandyChen1985

是docker的共享内存问题么？我还问了gpt，还以为是nccl宿主机和容器不同导致

---Original--- From: @.> Date: Wed, Jun 19, 2024 14:26 PM To: @.>; Cc: @.@.>; Subject: Re: [xorbitsai/inference] BUG: NCCL error: (Issue #1622)

我好像找到原因了，docker 启动的时候加个参数 --shm-size 20g，如下：

docker run -d -e XINFERENCE_MODEL_SRC=modelscope -v /data/xinference_llm/.xinference:/root/.xinference -v /data/xinference_llm/.cache/huggingface:/root/.cache/huggingface -v /data/xinference_llm/.cache/modelscope:/root/.cache/modelscope -p 9997:9997 --gpus all --name xinference --shm-size 20g
xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level debug

环境

@.:/workspace# pip list | grep nccl nvidia-nccl-cu12 2.20.5 @.:/workspace# pip list | grep torch torch 2.3.0 torchaudio 2.3.0 torchelastic 0.2.2 torchvision 0.18.0 vector-quantize-pytorch 1.14.24 @.***:/workspace#

启动日志。。

2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f0f0d780680>,), kwargs: {'model_uid': 'qwen2-chat-72b-1-0', 'model_name': 'qwen2-instruct', 'model_size_in_billions': 72, 'model_format': 'pytorch', 'quantization': 'none', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None} 2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model qwen2-chat-72b-1-0 2024-06-19 06:19:07,229 xinference.model.llm.core 95 DEBUG Launching qwen2-chat-72b-1-0 with VLLMChatModel 2024-06-19 06:19:07,229 xinference.model.llm.llm_family 95 INFO Caching from Modelscope: qwen/Qwen2-72B-Instruct 2024-06-19 06:19:07,230 xinference.model.llm.llm_family 95 INFO Cache /root/.xinference/cache/qwen2-instruct-pytorch-72b exists 2024-06-19 06:19:07,251 xinference.model.llm.vllm.core 9209 INFO Loading qwen2-chat-72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0. 2024-06-19 06:19:10,272 INFO worker.py:1753 -- Started a local Ray instance. INFO 06-19 06:19:11 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/root/.xinference/cache/qwen2-instruct-pytorch-72b', speculative_config=None, tokenizer='/root/.xinference/cache/qwen2-instruct-pytorch-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/.xinference/cache/qwen2-instruct-pytorch-72b) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerWrapper pid=16627) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 06-19 06:19:34 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16627) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 6x across cluster] (RayWorkerWrapper pid=17026) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster] INFO 06-19 06:19:44 distributed_gpu_executor.py:56] # GPU blocks: 4316, # CPU blocks: 6553 (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (RayWorkerWrapper pid=16937) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB [repeated 6x across cluster] INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. IN image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Jun 19 '24 08:06 ye7love7

我好像找到原因了，docker 启动的时候加个参数 --shm-size 20g，如下：

docker run -d -e XINFERENCE_MODEL_SRC=modelscope -e XINFERENCE_MODEL_SRC=模型范围 -v /data/xinference_llm/.xinference:/root/.xinference -v /data/xinference_llm/.xinference：/root/.xinference -v /data/xinference_llm/.cache/huggingface:/root/.cache/huggingface -v /数据/xinference_llm/.cache/huggingface：/root/.cache/huggingface -v /data/xinference_llm/.cache/modelscope:/root/.cache/modelscope -v /数据/xinference_llm/.cache/modelscope：/root/.cache/modelscope -p 9997:9997 -第 9997 页：9997 --gpus all --GPU 全部 --name xinference --名称 xinference --shm-size 20g
xprobe/xinference:v0.12.0 xprobe/xinference：v0.12.0 xinference-local -H 0.0.0.0 --log-level debug

环境

root@0cf3d74a2d6f:/workspace# pip list | grep ncclroot@0cf3d74a2d6f：/workspace# pip 列表 |GREP NCCL公司 nvidia-nccl-cu12 2.20.5 英伟达-NCCL-CU12 2.20.5 root@0cf3d74a2d6f:/workspace# pip list | grep torchroot@0cf3d74a2d6f：/workspace# pip 列表 |GREP 火炬 torch 2.3.0 火炬 2.3.0 torchaudio 2.3.0 火炬音频 2.3.0 torchelastic 0.2.2 火炬弹性 0.2.2 torchvision 0.18.0 火炬视界 0.18.0 vector-quantize-pytorch 1.14.24向量化 pytorch 1.14.24 root@0cf3d74a2d6f:/workspace#root@0cf3d74a2d6f：/工作区#

启动日志。。

2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f0f0d780680>,), kwargs: {'model_uid': 'qwen2-chat-72b-1-0', 'model_name': 'qwen2-instruct', 'model_size_in_billions': 72, 'model_format': 'pytorch', 'quantization': 'none', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None}2024-06-19 06：18：56,050 xinference.core.worker 95 调试输入launch_builtin_model， args：（<xinference.core.worker.WorkerActor 对象在 0x7f0f0d780680>，）， kwargs： {'model_uid'： 'qwen2-chat-72b-1-0'， 'model_name'： 'qwen2-instruct'， 'model_size_in_billions'： 72， 'model_format'： 'pytorch'， 'quantization'： 'none'， 'model_engine'： 'vLLM'， 'model_type'： 'LLM'， 'n_gpu'： 8， 'request_limits'：无， 'peft_model_config'：无， 'gpu_idx'：无} 2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model qwen2-chat-72b-1-02024-06-19 06：18：56,050 xinference.core.worker 95 调试已选择的 GPU：[0、1、2、3、4、5、6、7] 适用于型号 qwen2-chat-72b-1-0 2024-06-19 06:19:07,229 xinference.model.llm.core 95 DEBUG Launching qwen2-chat-72b-1-0 with VLLMChatModel2024-06-19 06：19：07,229 xinference.model.llm.core 95 DEBUG 使用 VLLMChatModel 启动 qwen2-chat-72b-1-0 2024-06-19 06:19:07,229 xinference.model.llm.llm_family 95 INFO Caching from Modelscope: qwen/Qwen2-72B-Instruct2024-06-19 06：19：07,229 xinference.model.llm.llm_family 95 INFO 从 Modelscope 缓存：qwen/Qwen2-72B-Instruct 2024-06-19 06:19:07,230 xinference.model.llm.llm_family 95 INFO Cache /root/.xinference/cache/qwen2-instruct-pytorch-72b exists2024-06-19 06：19：07,230 xinference.model.llm.llm_family 95 INFO 缓存 /root/.xinference/cache/qwen2-instruct-pytorch-72b 存在 2024-06-19 06:19:07,251 xinference.model.llm.vllm.core 9209 INFO Loading qwen2-chat-72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.2024-06-19 06：19：07,251 xinference.model.llm.vllm.core 9209 INFO 加载 qwen2-chat-72b 具有以下模型配置： {'tokenizer_mode'： 'auto'， 'trust_remote_code'： True， 'tensor_parallel_size'： 8， 'block_size'： 16， 'swap_space'： 4， 'gpu_memory_utilization'： 0.9， 'max_num_seqs'： 256， 'quantization'： None， 'max_model_len'： 4096}Enable lora： False.劳拉计数：0。 2024-06-19 06:19:10,272 INFO worker.py:1753 -- Started a local Ray instance.2024-06-19 06：19：10,272 INFO worker.py:1753 -- 启动本地 Ray 实例。 INFO 06-19 06:19:11 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/root/.xinference/cache/qwen2-instruct-pytorch-72b', speculative_config=None, tokenizer='/root/.xinference/cache/qwen2-instruct-pytorch-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/.xinference/cache/qwen2-instruct-pytorch-72b)INFO 06-19 06：19：11 llm_engine.py：161] 初始化LLM引擎（v0.4.3） config： model='/root/.xinference/cache/qwen2-instruct-pytorch-72b'， speculative_config=None， tokenizer='/root/.xinference/cache/qwen2-instruct-pytorch-72b'， skip_tokenizer_init=False， tokenizer_mode=auto， revision=None， rope_scaling=None， tokenizer_revision=None， trust_remote_code=True， dtype=torch.bfloat16， max_seq_len=4096， download_dir=None， load_format=LoadFormat.AUTO， tensor_parallel_size=8， disable_custom_all_reduce=False， quantization=None， enforce_eager=False， kv_cache_dtype=auto， quantization_param_path=None， device_config=cuda， decoding_config=DecodingConfig（guided_decoding_backend='outlines'）， seed=0， served_model_name=/root/.xinference/cache/qwen2-instruct-pytorch-72b） Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.词汇表中添加了特殊标记，请确保对相关的单词嵌入进行微调或训练。 INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2INFO 06-19 06：19：29 utils.py:618] 从库 libnccl.so.2 找到 nccl (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2（RayWorkerWrapper pid=16720）INFO 06-19 06：19：29 utils.py:618] 从库 libnccl.so.2 找到 nccl (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5（RayWorkerWrapper pid=16720）INFO 06-19 06：19：29 pynccl.py:65] vLLM 正在使用 nccl==2.20.5 INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5INFO 06-19 06：19：29 pynccl.py:65] vLLM 正在使用 nccl==2.20.5 WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.警告 06-19 06：19：29 custom_all_reduce.py：158] 自定义 allreduce 被禁用，因为它在两个以上的仅限 PCIe 的 GPU 上不受支持。若要静音此警告，请显式指定 disable_custom_all_reduce=True。 (RayWorkerWrapper pid=16627) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.（RayWorkerWrapper pid=16627）警告 06-19 06：19：29 custom_all_reduce.py：158] 自定义 allreduce 被禁用，因为它在两个以上的仅限 PCIe 的 GPU 上不受支持。若要静音此警告，请显式指定 disable_custom_all_reduce=True。 INFO 06-19 06:19:34 model_runner.py:146] Loading model weights took 16.9987 GBINFO 06-19 06：19：34 model_runner.py：146] 加载模型权重需要 16.9987 GB (RayWorkerWrapper pid=16627) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB（RayWorkerWrapper pid=16627）INFO 06-19 06：19：38 model_runner.py：146] 加载模型权重需要 16.9987 GB (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)（RayWorkerWrapper pid=16937）INFO 06-19 06：19：29 utils.py:618] 从库 libnccl.so.2 中找到 nccl [跨集群重复 6 次]（Ray 默认删除重复日志。设置 RAY_DEDUP_LOGS=0 以禁用重复数据删除，或参 https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication 以获取更多选项。 (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 6x across cluster]（RayWorkerWrapper pid=16937）INFO 06-19 06：19：29 pynccl.py:65] vLLM 正在使用 nccl==2.20.5 [跨集群重复 6 次] (RayWorkerWrapper pid=17026) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster]（RayWorkerWrapper pid=17026）警告 06-19 06：19：29 custom_all_reduce.py：158] 自定义 allreduce 被禁用，因为它在两个以上的仅限 PCIe 的 GPU 上不受支持。若要静音此警告，请显式指定 disable_custom_all_reduce=True。[跨集群重复 6 次] INFO 06-19 06:19:44 distributed_gpu_executor.py:56] # GPU blocks: 4316, # CPU blocks: 6553INFO 06-19 06：19：44 distributed_gpu_executor.py：56] # GPU 块数：4316，# CPU 块数：6553 (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.（RayWorkerWrapper pid=17026）INFO 06-19 06：19：47 model_runner.py：854] 捕获 CUDA 图形的模型。如果模型不是静态的，这可能会导致意外后果。要在 eager 模式下运行模型，请在 CLI 中设置 'enforce_eager=True' 或使用'--enforce-eager'。 (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.（RayWorkerWrapper pid=17026）INFO 06-19 06：19：47 model_runner.py：858] CUDA 图每个 GPU 可以额外占用 13 GiB 内存。如果内存不足，请考虑减少 gpu_memory_utilization 或强制执行急切模式。您还可以根据需要减少内存使用量 max_num_seqs 。 (RayWorkerWrapper pid=16937) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB [repeated 6x across cluster]（RayWorkerWrapper pid=16937）INFO 06-19 06：19：38 model_runner.py：146] 加载模型权重占用了 16.9987 GB [跨集群重复 6 倍] INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.INFO 06-19 06：19：47 model_runner.py：854] 捕获 CUDA 图形的模型。如果模型不是静态的，这可能会导致意外后果。要在 eager 模式下运行模型，请在 CLI 中设置 'enforce_eager=True' 或使用'--enforce-eager'。 INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1INFO 06-19 06：19：47 model_runner.py：858] CUDA 图可以额外 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.每个 GPU 3 GiB 内存。如果内存不足，请考虑减少 gpu_memory_utilization 或强制执行急切模式。您还可以根据需要减少内存使用量 max_num_seqs 。 IN 在

感谢，配置了这个就成功了，看来是docker的锅

Jun 20 '24 05:06 ZimaBlueee

-d

试过 xinference docker-compose 分布式部署的吗? 怎么修改shm-size

Jun 23 '24 12:06 yuanzhiwei

我好像找到原因了，docker 启动的时候加个参数 --shm-size 20g，如下：

docker run -d -e XINFERENCE_MODEL_SRC=modelscope -v /data/xinference_llm/.xinference:/root/.xinference -v /data/xinference_llm/.cache/huggingface:/root/.cache/huggingface -v /data/xinference_llm/.cache/modelscope:/root/.cache/modelscope -p 9997:9997 --gpus all --name xinference --shm-size 20g
xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level debug

环境

root@0cf3d74a2d6f:/workspace# pip list | grep nccl nvidia-nccl-cu12 2.20.5 root@0cf3d74a2d6f:/workspace# pip list | grep torch torch 2.3.0 torchaudio 2.3.0 torchelastic 0.2.2 torchvision 0.18.0 vector-quantize-pytorch 1.14.24 root@0cf3d74a2d6f:/workspace#

启动日志。。

2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f0f0d780680>,), kwargs: {'model_uid': 'qwen2-chat-72b-1-0', 'model_name': 'qwen2-instruct', 'model_size_in_billions': 72, 'model_format': 'pytorch', 'quantization': 'none', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 8, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None} 2024-06-19 06:18:56,050 xinference.core.worker 95 DEBUG GPU selected: [0, 1, 2, 3, 4, 5, 6, 7] for model qwen2-chat-72b-1-0 2024-06-19 06:19:07,229 xinference.model.llm.core 95 DEBUG Launching qwen2-chat-72b-1-0 with VLLMChatModel 2024-06-19 06:19:07,229 xinference.model.llm.llm_family 95 INFO Caching from Modelscope: qwen/Qwen2-72B-Instruct 2024-06-19 06:19:07,230 xinference.model.llm.llm_family 95 INFO Cache /root/.xinference/cache/qwen2-instruct-pytorch-72b exists 2024-06-19 06:19:07,251 xinference.model.llm.vllm.core 9209 INFO Loading qwen2-chat-72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 8, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0. 2024-06-19 06:19:10,272 INFO worker.py:1753 -- Started a local Ray instance. INFO 06-19 06:19:11 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/root/.xinference/cache/qwen2-instruct-pytorch-72b', speculative_config=None, tokenizer='/root/.xinference/cache/qwen2-instruct-pytorch-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/.xinference/cache/qwen2-instruct-pytorch-72b) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=16720) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerWrapper pid=16627) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 06-19 06:19:34 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16627) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 utils.py:618] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerWrapper pid=16937) INFO 06-19 06:19:29 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 6x across cluster] (RayWorkerWrapper pid=17026) WARNING 06-19 06:19:29 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster] INFO 06-19 06:19:44 distributed_gpu_executor.py:56] # GPU blocks: 4316, # CPU blocks: 6553 (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerWrapper pid=17026) INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (RayWorkerWrapper pid=16937) INFO 06-19 06:19:38 model_runner.py:146] Loading model weights took 16.9987 GB [repeated 6x across cluster] INFO 06-19 06:19:47 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 06-19 06:19:47 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. IN

请问试过 xinference docker-compose 多worker分布式部署的吗? 怎么修改shm-size

Jun 23 '24 12:06 yuanzhiwei

docker容器设置了shm-size也没用，两个显卡一起启动，一样报错 docker run --name xinference -d --restart always -p 9997:9997 -e XINFERENCE_HOME=/data -v /D/docker/xinference/:/data --shm-size 30g --gpus all xprobe/xinference:v0.12.3 xinference-local -H 0.0.0.0

2024-06-28 21:57:38 2024-06-28 13:57:38,789 xinference.model.llm.vllm.core 65 INFO Loading Qwen1.5-14B-Chat-GPTQ-int4 with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0. 2024-06-28 21:57:41 2024-06-28 13:57:41,801 INFO worker.py:1771 -- Started a local Ray instance. 2024-06-28 21:57:42 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-06-28 21:58:17 Error in sys.excepthook: 2024-06-28 21:58:17 2024-06-28 21:58:17 Original exception was: 2024-06-28 21:58:18 Error in sys.excepthook: 2024-06-28 21:58:18 2024-06-28 21:58:18 Original exception was: 2024-06-28 21:58:19 [rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] 2024-06-28 21:58:34 Error in sys.excepthook: 2024-06-28 21:58:34 2024-06-28 21:58:34 Original exception was: 2024-06-28 21:58:34 Error in sys.excepthook: 2024-06-28 21:58:34 2024-06-28 21:58:34 Original exception was: 2024-06-28 21:58:35 [rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] 2024-06-28 22:08:23 2024-06-28 14:08:23,718 xinference.core.supervisor 45 INFO Xinference supervisor 0.0.0.0:16953 started 2024-06-28 22:08:24 2024-06-28 14:08:24,062 xinference.core.worker 45 INFO Starting metrics export server at 0.0.0.0:None 2024-06-28 22:08:24 2024-06-28 14:08:24,065 xinference.core.worker 45 INFO Checking metrics export server... 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] Error executing method initialize_cache. This might cause deadlock in distributed execution. 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] Traceback (most recent call last): 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return executor(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 195, in initialize_cache 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] self._warm_up_model() 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 205, in _warm_up_model 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] self.model_runner.capture_model(self.gpu_cache) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return func(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 910, in capture_model 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] graph_runner.capture( 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 959, in capture 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] self.model( 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return self._call_impl(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return forward_call(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] hidden_states = self.model(input_ids, positions, kv_caches, 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return self._call_impl(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return forward_call(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] hidden_states, residual = layer( 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return self._call_impl(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return forward_call(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 216, in forward 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] hidden_states = self.mlp(hidden_states) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return self._call_impl(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return forward_call(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 75, in forward 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] gate_up, _ = self.gate_up_proj(x) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return self._call_impl(*args, *kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in call_impl 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return forward_call(*args, **kwargs) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 283, in forward 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] output_parallel = self.quant_method.apply(self, input, bias) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/gptq.py", line 218, in apply 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros, 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] File "/opt/conda/lib/python3.10/site-packages/vllm/_custom_ops.py", line 145, in gptq_gemm 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] RuntimeError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU has a total capacity of 22.00 GiB of which 2.21 GiB is free. Process 65 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 18.13 GiB is allocated by PyTorch, and 116.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:1143 (most recent call first): 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f166c1f7897 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #1: + 0x35abf (0x7f166c2b2abf in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #2: + 0x35ca7 (0x7f166c2b2ca7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #3: + 0x360e7 (0x7f166c2b30e7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #4: + 0x1866589 (0x7f14c2d0c589 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #5: at::detail::empty_generic(c10::ArrayRef, c10::Allocator, c10::DispatchKeySet, c10::ScalarType, std::optionalc10::MemoryFormat) + 0x14 (0x7f14c2d061e4 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #6: at::detail::empty_cuda(c10::ArrayRef, c10::ScalarType, std::optionalc10::Device, std::optionalc10::MemoryFormat) + 0x111 (0x7f148e647641 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #7: at::detail::empty_cuda(c10::ArrayRef, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x36 (0x7f148e647916 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #8: at::native::empty_cuda(c10::ArrayRef, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x20 (0x7f148e885a30 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #9: + 0x329a789 (0x7f1490890789 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #10: + 0x329a86b (0x7f149089086b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRefc10::SymInt, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0xe7 (0x7f14c3d0abe7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #12: + 0x2c10def (0x7f14c40b6def in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #13: at::_ops::empty_memory_format::call(c10::ArrayRefc10::SymInt, std::optionalc10::ScalarType, std::optionalc10::Layout, std::optionalc10::Device, std::optional, std::optionalc10::MemoryFormat) + 0x1a0 (0x7f14c3d52a00 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #14: torch::empty(c10::ArrayRef, c10::TensorOptions, std::optionalc10::MemoryFormat) + 0x20d (0x7f14195c65ed in /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #15: gptq_gemm(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool, int) + 0x2cf (0x7f14195c229f in /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #16: + 0x1b0a22 (0x7f14195e1a22 in /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #17: + 0x1ac915 (0x7f14195dd915 in /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #18: ray::RayWorkerWrapper.execute_method() [0x4fc697] 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #19: _PyObject_MakeTpCall + 0x25b (0x4f614b in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #20: _PyEval_EvalFrameDefault + 0x53d6 (0x4f2376 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #21: _PyFunction_Vectorcall + 0x6f (0x4fcadf in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #22: _PyEval_EvalFrameDefault + 0x4b26 (0x4f1ac6 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #23: _PyFunction_Vectorcall + 0x6f (0x4fcadf in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #24: _PyEval_EvalFrameDefault + 0x731 (0x4ed6d1 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #25: ray::RayWorkerWrapper.execute_method() [0x508006] 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #26: _PyEval_EvalFrameDefault + 0x2b79 (0x4efb19 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #27: ray::RayWorkerWrapper.execute_method() [0x508006] 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #28: _PyEval_EvalFrameDefault + 0x2b79 (0x4efb19 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #29: _PyObject_FastCallDictTstate + 0xcd (0x4f561d in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #30: _PyObject_Call_Prepend + 0x66 (0x506596 in ray::RayWorkerWrapper.execute_method) 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #31: ray::RayWorkerWrapper.execute_method() [0x5cc323] 2024-06-28 22:00:48 (RayWorkerWrapper pid=854) ERROR 06-28 14:00:48 worker_base.py:148] frame #32: _PyObject_MakeTpCall + 0x25b (0x4f614b in ray::RayWorkerWrapper.execute_method)

Jun 28 '24 14:06 worm128

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 06:08 github-actions[bot]

我使用helm chart部署在k8s集群上，Pod默认只允许最多使用64MB共享内存，所以临时解决方案是用emptyDir直接设定共享内存给worker Pod，等效于docker的--shm-size参数

Oct 25 '24 01:10 nakroy

same issue

Jun 27 '25 13:06 fengzengfly