ipex-llm vLLM 0.5.4 failure to start the TP+ PP mode on 8 ARC

The vllm docker image is

intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

vLLM start command is

'model="/llm/models/Qwen2-72B-Instruct/" served_model_name="Qwen2-72B-Instruct"

source /opt/intel/1ccl-wks/setvars.sh

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.85
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit fp8
--max-model-len 2048
--max-num-batched-tokens 2048
--max-num-seqs 24
-tp 4 -pp 2 --disable-log-requests'

The error information is

(WrapperWithLoadBit pid=35347) 2024:09:13-11:21:50:(35347) |CCL_ERROR| exchange_utils.cpp:202 sendmsg_fd: condition !check_msg_retval("sendmsg", send_bytes, iov, msg, sizeof(u.cntr_buf), sock, fd) failed (WrapperWithLoadBit pid=35347) errno: Broken pipe 2024:09:13-11:21:50:(31157) |CCL_ERROR| exchange_utils.cpp:202 sendmsg_fd: condition !check_msg_retval("sendmsg", send_bytes, iov, msg, sizeof(u.cntr_buf), sock, fd) failed errno: Broken pipe (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] Error executing method init_device. This might cause deadlock in distributed execution. (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] Traceback (most recent call last): (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 378, in execute_method (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] return executor(*args, **kwargs) (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] ^^^^^^^^^^^^^^^^^^^^^^^^^ (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 105, in init_device (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] self.init_worker_distributed_environment() (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 205, in init_worker_distributed_environment (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] get_pp_group().all_reduce(torch.zeros(1).xpu()) (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 293, in all_reduce (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] torch.distributed.all_reduce(input_, group=self.device_group) (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] return func(*args, **kwargs) (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] ^^^^^^^^^^^^^^^^^^^^^ (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2055, in all_reduce (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] work.wait() (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] RuntimeError: oneCCL: exchange_utils.cpp:202 sendmsg_fd: EXCEPTION: errno: Broken pipe ERROR 09-13 11:21:51 worker_base.py:386] Error executing method init_device. This might cause deadlock in distributed execution. ERROR 09-13 11:21:51 worker_base.py:386] Traceback (most recent call last): ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 378, in execute_method ERROR 09-13 11:21:51 worker_base.py:386] return executor(*args, **kwargs) ERROR 09-13 11:21:51 worker_base.py:386] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 105, in init_device ERROR 09-13 11:21:51 worker_base.py:386] self.init_worker_distributed_environment() ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 205, in init_worker_distributed_environment ERROR 09-13 11:21:51 worker_base.py:386] get_pp_group().all_reduce(torch.zeros(1).xpu()) ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 293, in all_reduce ERROR 09-13 11:21:51 worker_base.py:386] torch.distributed.all_reduce(input_, group=self.device_group) ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper ERROR 09-13 11:21:51 worker_base.py:386] return func(*args, **kwargs) ERROR 09-13 11:21:51 worker_base.py:386] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-13 11:21:51 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2055, in all_reduce ERROR 09-13 11:21:51 worker_base.py:386] work.wait() ERROR 09-13 11:21:51 worker_base.py:386] RuntimeError: oneCCL: exchange_utils.cpp:202 sendmsg_fd: EXCEPTION: errno: Broken pipe Process Process-65: Traceback (most recent call last): File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/rpc/server.py", line 220, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port, load_in_low_bit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/rpc/server.py", line 27, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 43, in from_engine_args return super().from_engine_args(engine_args, start_engine_loop, usage_context, stat_loggers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 476, in from_engine_args engine = cls( ^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 29, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 381, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 557, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/llm_engine.py", line 255, in init self.model_executor = executor_class( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_xpu_executor.py", line 35, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 555, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/xpu_executor.py", line 53, in init self._init_executor() File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor self._init_workers_ray(placement_group) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 230, in _init_workers_ray self._run_workers("init_device") File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 468, in run_workers self.driver_worker.execute_method(method, *driver_args, File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 387, in execute_method raise e File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 378, in execute_method return executor(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 105, in init_device self.init_worker_distributed_environment() File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 205, in init_worker_distributed_environment get_pp_group().all_reduce(torch.zeros(1).xpu()) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 293, in all_reduce torch.distributed.all_reduce(input, group=self.device_group) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2055, in all_reduce work.wait()

The workaround is

vi /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py
            get_pp_group().all_gather(torch.zeros(1).xpu())
            #get_pp_group().all_reduce(torch.zeros(1).xpu())

Sep 14 '24 02:09 oldmikeyang

After use modify the /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py with the get_pp_group().all_gather(torch.zeros(1).xpu())

vLLM start with the following error

2024:09:14-11:02:35:( 241) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:09:14-11:02:35:( 241) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices -----> current rank: 0, world size: 4, byte_count: 33554432 (WrapperWithLoadBit pid=3548) -----> current rank: 1, world size: 4, byte_count: 33554432 (WrapperWithLoadBit pid=4874) INFO 09-14 11:01:33 selector.py:127] Cannot use _Backend.FLASH_ATTN backend on XPU. [repeated 13x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (WrapperWithLoadBit pid=4874) INFO 09-14 11:01:33 selector.py:76] Using IPEX attention backend. [repeated 13x across cluster] (WrapperWithLoadBit pid=4874) 2024:09:14-11:01:32:( 4874) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi [repeated 6x across cluster] (WrapperWithLoadBit pid=4874) 2024:09:14-11:01:33:( 5291) |CCL_WARN| no membind support for NUMA node 1, skip thread membind [repeated 6x across cluster] (WrapperWithLoadBit pid=3548) 2024:09:14-11:02:35:( 3548) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 392x across cluster] (WrapperWithLoadBit pid=4211) -----> current rank: 0, world size: 4, byte_count: 33554432 [repeated 3x across cluster] (WrapperWithLoadBit pid=4211) 2024:09:14-11:02:42:( 4211) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 168x across cluster] (WrapperWithLoadBit pid=4211) GPU-Xeon4410Y-ARC770:rank4: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen (WrapperWithLoadBit pid=4211) 2024-09-14 11:02:34,843 - INFO - Loading model weights took 9.7260 GB [repeated 6x across cluster] (WrapperWithLoadBit pid=4211) [1726282973.117141429] GPU-Xeon4410Y-ARC770:rank4.perWithLoadBit.execute_method: Reading from remote process' memory failed. Disabling CMA support (WrapperWithLoadBit pid=4874) -----> current rank: 3, world size: 4, byte_count: 33554432 [repeated 3x across cluster] (WrapperWithLoadBit pid=4874) 2024:09:14-11:02:42:( 4874) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 168x across cluster] (WrapperWithLoadBit pid=4211) /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown (WrapperWithLoadBit pid=4211) warnings.warn('resource_tracker: There appear to be %d ' (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb4b97d156cb581f7dbb82f2701000000 Worker ID: 3604d35e3933d71bde300919a86052b60242e5e8b0941e93374b84dc Node ID: 5c6f23e2660bea6d8af3fd5fd8ab94aa9233ae8d1bbfa29dfe86f788 Worker IP address: 10.240.108.91 Worker port: 38205 Worker PID: 4211 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. Process Process-65: Traceback (most recent call last): File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/rpc/server.py", line 220, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port, load_in_low_bit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/rpc/server.py", line 27, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 43, in from_engine_args return super().from_engine_args(engine_args, start_engine_loop, usage_context, stat_loggers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 476, in from_engine_args engine = cls( ^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 29, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 381, in init self.engine = self._init_engine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 557, in _init_engine return engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/llm_engine.py", line 270, in init self._initialize_kv_caches() File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/engine/llm_engine.py", line 369, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 481, in _run_workers ray_worker_outputs = ray.get(ray_worker_outputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 2661, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 873, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: get_ipex_llm_wrapper..WrapperWithLoadBit actor_id: b4b97d156cb581f7dbb82f2701000000 pid: 4211 namespace: ac18545b-bc5c-4a95-a517-cfa1e1af06de ip: 10.240.108.91 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Sep 14 '24 03:09 oldmikeyang

2024:09:13-11:21:50:(31157) |CCL_ERROR| exchange_utils.cpp:202 sendmsg_fd: condition !check_msg_retval("sendmsg", send_bytes, iov, msg, sizeof(u.cntr_buf), sock, fd) failed errno: Broken pipe (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] Error executing method init_device. This might cause deadlock in distributed execution. (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] Traceback (most recent call last): (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 378, in execute_method (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] return executor(*args, **kwargs) (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] ^^^^^^^^^^^^^^^^^^^^^^^^^ (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 105, in init_device (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] self.init_worker_distributed_environment() (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 205, in init_worker_distributed_environment (WrapperWithLoadBit pid=35347) ERROR 09-13 11:21:50 worker_base.py:386] get_pp_group().all_reduce(torch.zeros(1).xpu())

The first issue could be solved by this modification:

vi /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py
            get_pp_group().all_gather(torch.zeros(1).xpu())
            #get_pp_group().all_reduce(torch.zeros(1).xpu())

(WrapperWithLoadBit pid=4874) 2024:09:14-11:01:33:( 5291) |CCL_WARN| no membind support for NUMA node 1, skip thread membind [repeated 6x across cluster] (WrapperWithLoadBit pid=3548) 2024:09:14-11:02:35:( 3548) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 392x across cluster] (WrapperWithLoadBit pid=4211) -----> current rank: 0, world size: 4, byte_count: 33554432 [repeated 3x across cluster] (WrapperWithLoadBit pid=4211) 2024:09:14-11:02:42:( 4211) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 168x across cluster] (WrapperWithLoadBit pid=4211) GPU-Xeon4410Y-ARC770:rank4: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen

We were unable to reproduce the second issue in our environment. It may be related to settings in the startup container script.

Sep 19 '24 02:09 xiangyuT

fixed

Dec 11 '24 07:12 glorysdj