Fatal Python error: Segmentation fault
Describe the bug This error lead the vllm not work, but still up. Running 4 Arc, 45 error of 50 tests. But run single Arc, 50 test all work no error.
How to reproduce Steps to reproduce the error: vllm input parameters
export SHM_SIZE="32g"
export DTYPE="float16"
export QUANTIZATION="fp8"
export MAX_MODEL_LEN="8192"
export MAX_NUM_BATCHED_TOKENS="8192"
export MAX_NUM_SEQS="256"
export LLM_MODEL_ID="Qwen/Qwen2.5-Coder-7B-Instruct"
export LLM_MODEL_LOCAL_PATH="/data/Qwen/Qwen2.5-Coder-7B-Instruct"
export GPU_AFFINITY="1,2,3,4"
export TENSOR_PARALLEL_SIZE=4
export TAG=1.2
services:
vllm-service:
image: intelanalytics/ipex-llm-serving-xpu:2.2.0-b14
container_name: vllm-service
ports:
- "${LLM_ENDPOINT_PORT:-8008}:80"
privileged: true
ipc: host
devices:
- "/dev/dri:/dev/dri"
volumes:
- "${MODEL_CACHE:-./data}:/data"
shm_size: ${SHM_SIZE:-8g}
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HUGGINGFACE_HUB_CACHE: "/data"
LLM_MODEL_ID: ${LLM_MODEL_ID}
LLM_MODEL_LOCAL_PATH: ${LLM_MODEL_LOCAL_PATH}
VLLM_TORCH_PROFILER_DIR: "/mnt"
DTYPE: ${DTYPE:-float16}
QUANTIZATION: ${QUANTIZATION:-fp8}
MAX_MODEL_LEN: ${MAX_MODEL_LEN:-2048}
MAX_NUM_BATCHED_TOKENS: ${MAX_NUM_BATCHED_TOKENS:-4000}
MAX_NUM_SEQS: ${MAX_NUM_SEQS:-256}
TENSOR_PARALLEL_SIZE: ${TENSOR_PARALLEL_SIZE:-1}
healthcheck:
test: ["CMD-SHELL", "curl -f http://vllm-service:80/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
entrypoint: /bin/bash -c "export CCL_WORKER_COUNT=2 &&
export SYCL_CACHE_PERSISTENT=1 &&
export FI_PROVIDER=shm &&
export CCL_ATL_TRANSPORT=ofi &&
export CCL_ZE_IPC_EXCHANGE=sockets &&
export CCL_ATL_SHM=1 &&
export USE_XETLA=OFF &&
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 &&
export TORCH_LLM_ALLREDUCE=0 &&
export CCL_SAME_STREAM=1 &&
export CCL_BLOCKING_WAIT=0 &&
export ZE_AFFINITY_MASK=$GPU_AFFINITY &&
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $LLM_MODEL_ID \
--model $LLM_MODEL_LOCAL_PATH \
--port 80 \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype $DTYPE \
--enforce-eager \
--load-in-low-bit $QUANTIZATION \
--max-model-len $MAX_MODEL_LEN \
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
--max-num-seqs $MAX_NUM_SEQS \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--disable-async-output-proc \
--distributed-executor-backend ray"
networks:
default:
driver: bridge
model: Qwen/Qwen2.5-Coder-7B-Instruct
modelscope download --model Qwen/Qwen2.5-Coder-7B-Instruct --local_dir ./data/Qwen/Qwen2.5-Coder-7B-Instruct
docker compose up -d
Screenshots If applicable, add screenshots to help explain the problem
Environment information If possible, please attach the output of the environment check script, using:
- https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.bat, or
- https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.10.12
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4410Y
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
Stepping: 8
CPU max MHz: 3900.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
-----------------------------------------------------------------
Total CPU Memory: 503.547 GB
Memory Type: -----------------------------------------------------------------
Operating System:
Ubuntu 22.04.1 LTS \n \l
-----------------------------------------------------------------
Linux arc003 6.8.0-49-generic #49~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 6 17:42:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.33.20250307
Build ID: 00000000
Service:
Version: 1.2.33.20250307
Build ID: 00000000
Level Zero Version: 1.14.0
-----------------------------------------------------------------
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
Driver UUID 32332e34-332e-3237-3634-322e36390000
Driver Version 23.43.27642.69
-----------------------------------------------------------------
Driver related package version:
ii intel-fw-gpu 2025.13.2-398~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-i915-dkms 1.23.10.92.231129.101+i141-1 all Out of tree i915 driver.
ii intel-level-zero-gpu 1.3.27642.69-803.145~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-dev 1.14.0-803.123~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
./env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0019-0000-000856a08086 |
| | PCI BDF Address: 0000:19:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-002c-0000-000856a08086 |
| | PCI BDF Address: 0000:2c:00.0 |
| | DRM Device: /dev/dri/card2 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 2 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0052-0000-000856a08086 |
| | PCI BDF Address: 0000:52:00.0 |
| | DRM Device: /dev/dri/card3 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 3 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0065-0000-000856a08086 |
| | PCI BDF Address: 0000:65:00.0 |
| | DRM Device: /dev/dri/card4 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 4 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-009b-0000-000856a08086 |
| | PCI BDF Address: 0000:9b:00.0 |
| | DRM Device: /dev/dri/card5 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 5 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00ad-0000-000856a08086 |
| | PCI BDF Address: 0000:ad:00.0 |
| | DRM Device: /dev/dri/card6 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 6 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00d1-0000-000856a08086 |
| | PCI BDF Address: 0000:d1:00.0 |
| | DRM Device: /dev/dri/card7 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 7 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00e3-0000-000856a08086 |
| | PCI BDF Address: 0000:e3:00.0 |
| | DRM Device: /dev/dri/card8 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
GPU1 Memory size=16G
GPU2 Memory size=16G
GPU3 Memory size=16G
GPU4 Memory size=16G
GPU5 Memory size=16G
GPU6 Memory size=16G
GPU7 Memory size=16G
GPU8 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52) (prog-if 00 [VGA controller])
DeviceName: Onboard VGA
Subsystem: ASPEED Technology, Inc. ASPEED Graphics Family
Flags: medium devsel, IRQ 16, NUMA node 0
Memory at 94000000 (32-bit, non-prefetchable) [size=16M]
Memory at 95000000 (32-bit, non-prefetchable) [size=256K]
I/O ports at 2000 [size=128]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+
Kernel driver in use: ast
--
19:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 228, NUMA node 0
Memory at 9e000000 (64-bit, non-prefetchable) [size=16M]
Memory at 5f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 9f000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
2c:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 231, NUMA node 0
Memory at a8000000 (64-bit, non-prefetchable) [size=16M]
Memory at 6f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at a9000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
52:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 235, NUMA node 0
Memory at bc000000 (64-bit, non-prefetchable) [size=16M]
Memory at 8f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at bd000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
65:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 239, NUMA node 0
Memory at c6000000 (64-bit, non-prefetchable) [size=16M]
Memory at 9f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at c7000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
9b:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 243, NUMA node 1
Memory at d8000000 (64-bit, non-prefetchable) [size=16M]
Memory at cf800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at d9000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
ad:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 247, NUMA node 1
Memory at e0000000 (64-bit, non-prefetchable) [size=16M]
Memory at df800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at e1000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
d1:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 251, NUMA node 1
Memory at f1000000 (64-bit, non-prefetchable) [size=16M]
Memory at ff800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at f2000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
e3:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1334
Flags: bus master, fast devsel, latency 0, IRQ 255, NUMA node 1
Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
Memory at 10f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at fa000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
-----------------------------------------------------------------
Additional context
INFO 05-14 02:49:05 __init__.py:180] Automatically detected platform xpu.
WARNING 05-14 02:49:06 api_server.py:538] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
WARNING 05-14 02:49:06 api_server.py:893] Warning: Please use `ipex_llm.vllm.xpu.entrypoints.openai.api_server` instead of `vllm.entrypoints.openai.api_server` to start the API server
INFO 05-14 02:49:06 api_server.py:837] vLLM API server version 0.6.6+ipexllm
INFO 05-14 02:49:06 api_server.py:838] args: Namespace(host=None, port=80, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen/Qwen2.5-Coder-7B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, low_bit_model_path=None, low_bit_save_path=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, load_in_low_bit='fp8')
INFO 05-14 02:49:06 api_server.py:197] Started engine process with PID 110
WARNING 05-14 02:49:06 config.py:2289] Casting torch.bfloat16 to torch.float16.
INFO 05-14 02:49:09 __init__.py:180] Automatically detected platform xpu.
WARNING 05-14 02:49:11 api_server.py:538] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
WARNING 05-14 02:49:11 config.py:2289] Casting torch.bfloat16 to torch.float16.
INFO 05-14 02:49:11 config.py:521] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 05-14 02:49:15 config.py:521] This model supports multiple tasks: {'generate', 'classify', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 05-14 02:49:15 ray_utils.py:239] No existing RAY instance detected. A new instance will be launched with current node resources.
2025-05-14 02:49:17,922 INFO worker.py:1841 -- Started a local Ray instance.
INFO 05-14 02:49:19 llm_engine.py:234] Initializing an LLM engine (v0.6.6+ipexllm) with config: model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen/Qwen2.5-Coder-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-Coder-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
INFO 05-14 02:49:19 ray_gpu_executor.py:123] use_ray_spmd_worker: False
(WrapperWithLoadBit pid=652) INFO 05-14 02:49:22 __init__.py:180] Automatically detected platform xpu.
INFO 05-14 02:49:24 xpu.py:27] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 05-14 02:49:24 selector.py:155] Using IPEX attention backend.
WARNING 05-14 02:49:24 _ipex_ops.py:12] Import error msg: No module named 'intel_extension_for_pytorch'
INFO 05-14 02:49:24 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available.
(WrapperWithLoadBit pid=665) INFO 05-14 02:49:24 xpu.py:27] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=665) INFO 05-14 02:49:24 selector.py:155] Using IPEX attention backend.
(WrapperWithLoadBit pid=665) WARNING 05-14 02:49:24 _ipex_ops.py:12] Import error msg: No module named 'intel_extension_for_pytorch'
(WrapperWithLoadBit pid=665) INFO 05-14 02:49:24 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-14 02:49:24 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_2fc713d0'), local_subscribe_port=43547, remote_subscribe_port=None)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 9.06it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 7.28it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 7.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 8.41it/s]
2025-05-14 02:49:25,171 - INFO - Converting the current model to fp8_e5m2 format......
2025-05-14 02:49:25,172 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-14 02:49:26,786 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-14 02:49:27,667 - INFO - Loading model weights took 2.0819 GB
(WrapperWithLoadBit pid=647) 2025-05-14 02:49:29,204 - INFO - Converting the current model to fp8_e5m2 format......
(WrapperWithLoadBit pid=647) 2025-05-14 02:49:29,204 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=652) 2025-05-14 02:49:29,819 - INFO - Converting the current model to fp8_e5m2 format...... [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WrapperWithLoadBit pid=647) 2025-05-14 02:49:34,226 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=647) 2025-05-14 02:49:35,710 - INFO - Loading model weights took 2.0819 GB
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_WORKER_COUNT changed to be 2 (default:1)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_ATL_SHM changed to be 1 (default:0)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 4 (default:-1)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
2025:05:14-02:49:37:( 110) |CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
(WrapperWithLoadBit pid=652) *** SIGSEGV received at time=1747162178 on cpu 18 ***
(WrapperWithLoadBit pid=652) PC: @ 0x7c17c042205e (unknown) smr_map_to_endpoint
(WrapperWithLoadBit pid=652) @ 0x7c3db68a2733 (unknown) (unknown)
(WrapperWithLoadBit pid=652) [2025-05-14 02:49:38,328 E 652 652] logging.cc:484: *** SIGSEGV received at time=1747162178 on cpu 18 ***
(WrapperWithLoadBit pid=652) [2025-05-14 02:49:38,328 E 652 652] logging.cc:484: PC: @ 0x7c17c042205e (unknown) smr_map_to_endpoint
(WrapperWithLoadBit pid=652) [2025-05-14 02:49:38,328 E 652 652] logging.cc:484: @ 0x7c3db68a2733 (unknown) (unknown)
(WrapperWithLoadBit pid=652) Fatal Python error: Segmentation fault
(WrapperWithLoadBit pid=652)
(WrapperWithLoadBit pid=652) Stack (most recent call first):
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2806 in all_reduce
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 81 in wrapper
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/distributed/device_communicators/xpu_communicator.py", line 19 in all_reduce
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/distributed/parallel_state.py", line 345 in all_reduce
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/distributed/communication_op.py", line 11 in tensor_model_parallel_all_reduce
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 419 in forward
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 317 in get_input_embeddings
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 332 in forward
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168 in __call__
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477 in forward
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 949 in execute_model
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 839 in profile_run
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106 in determine_num_available_blocks
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461 in execute_method
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/ray/_private/function_manager.py", line 696 in actor_method_executor
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 935 in main_loop
(WrapperWithLoadBit pid=652) File "/usr/local/lib/python3.11/dist-packages/ray/_private/workers/default_worker.py", line 297 in <module>
(WrapperWithLoadBit pid=652)
(WrapperWithLoadBit pid=652) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, markupsafe._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imagingft, msgspec._core, sentencepiece._sentencepiece, regex._regex, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, pyarrow.lib, pyarrow._json (total: 52)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3c5176f98134daaa8c2456a501000000 Worker ID: b9be28cf125d015da221d7adadcb6a524901043d393dddb1ac1a3f38 Node ID: 13d4cdadeb61a0587c202b347d0135753701a7f7020c43436bbf524e Worker IP address: 172.19.0.2 Worker port: 39569 Worker PID: 652 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WrapperWithLoadBit pid=649) INFO 05-14 02:49:23 __init__.py:180] Automatically detected platform xpu. [repeated 3x across cluster]
(WrapperWithLoadBit pid=647) INFO 05-14 02:49:24 xpu.py:27] Cannot use _Backend.FLASH_ATTN backend on XPU. [repeated 2x across cluster]
(WrapperWithLoadBit pid=647) INFO 05-14 02:49:24 selector.py:155] Using IPEX attention backend. [repeated 2x across cluster]
(WrapperWithLoadBit pid=647) WARNING 05-14 02:49:24 _ipex_ops.py:12] Import error msg: No module named 'intel_extension_for_pytorch' [repeated 2x across cluster]
(WrapperWithLoadBit pid=647) INFO 05-14 02:49:24 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available. [repeated 2x across cluster]
Running with a single GPU works fine, but encountering issues with multi-GPU. This could be related to OneCCL. From the script you provided, it looks like you haven’t sourced the OneCCL environment script (setvars.sh). Please refer to the documentation here:
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#start-the-vllm-service
You can try adding the following line to your ENTRYPOINT to source the required environment:
source /opt/intel/1ccl-wks/setvars.sh
Here’s an example entrypoint that includes it:
entrypoint: /bin/bash -c "source /opt/intel/1ccl-wks/setvars.sh &&
export CCL_WORKER_COUNT=2 &&
export SYCL_CACHE_PERSISTENT=1 &&
export FI_PROVIDER=shm &&
export CCL_ATL_TRANSPORT=ofi &&
export CCL_ZE_IPC_EXCHANGE=sockets &&
export CCL_ATL_SHM=1 &&
export USE_XETLA=OFF &&
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 &&
export TORCH_LLM_ALLREDUCE=0 &&
export CCL_SAME_STREAM=1 &&
export CCL_BLOCKING_WAIT=0 &&
export ZE_AFFINITY_MASK=$GPU_AFFINITY &&
source /opt/intel/1ccl-wks/setvars.sh &&
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $LLM_MODEL_ID \
--model $LLM_MODEL_LOCAL_PATH \
--port 80 \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype $DTYPE \
--enforce-eager \
--load-in-low-bit $QUANTIZATION \
--max-model-len $MAX_MODEL_LEN \
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
--max-num-seqs $MAX_NUM_SEQS \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--disable-async-output-proc \
--distributed-executor-backend ray"
Let me know if this helps resolve the issue or if further assistance is needed.
source /opt/intel/1ccl-wks/setvars.sh && Hi @liu-shaojun
I sourced the OneCCL environment script (setvars.sh), but the error still exist.
Hi,
Could you please let us know whether your CPU is an Intel® Xeon® or Intel® Core™ processor?
In the meantime, we recommend trying our latest Docker image by running the following command:
docker pull intelanalytics/ipex-llm-serving-xpu:0.8.3-b19
For detailed steps and the latest configuration guidance, please refer to our documentation here: https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
Hi,
Could you please let us know whether your CPU is an Intel® Xeon® or Intel® Core™ processor?
In the meantime, we recommend trying our latest Docker image by running the following command:
docker pull intelanalytics/ipex-llm-serving-xpu:0.8.3-b19For detailed steps and the latest configuration guidance, please refer to our documentation here: https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
This CPU information. And I am testing the image intelanalytics/ipex-llm-serving-xpu:0.8.3-b19
ssp@arc003:/mnt/home/ssp/opea-installer$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4410Y
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
Stepping: 8
CPU max MHz: 3900.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx s
mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd
dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lb
r ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 1.1 MiB (24 instances)
L1i: 768 KiB (24 instances)
L2: 48 MiB (24 instances)
L3: 60 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Srbds: Not affected
Tsx async abort: Not affected
In the meantime, we recommend trying our latest Docker image by running the following command:
docker pull intelanalytics/ipex-llm-serving-xpu:0.8.3-b19
Hi @liu-shaojun I use the imageintelanalytics/ipex-llm-serving-xpu:0.8.3-b19, but the error still exist.
[W516 10:39:51.099061857 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
INFO 05-16 10:39:54 [__init__.py:239] Automatically detected platform xpu.
[W516 10:39:54.366273295 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
WARNING 05-16 10:39:55 [_logger.py:68] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
WARNING 05-16 10:39:55 [_logger.py:68] Warning: Please use `ipex_llm.vllm.xpu.entrypoints.openai.api_server` instead of `vllm.entrypoints.openai.api_server` to start the API server
INFO 05-16 10:39:55 [api_server.py:1080] vLLM API server version 0.8.3+ipexllm
INFO 05-16 10:39:55 [api_server.py:1081] args: Namespace(host=None, port=80, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='ray', pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen/Qwen2.5-Coder-7B-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, low_bit_model_path=None, low_bit_save_path=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, load_in_low_bit='fp8')
WARNING 05-16 10:39:55 [_logger.py:68] Casting torch.bfloat16 to torch.float16.
INFO 05-16 10:40:00 [config.py:604] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
WARNING 05-16 10:40:00 [_logger.py:68] --disable-async-output-proc is not supported by the V1 Engine. Falling back to V0. We recommend to remove --disable-async-output-proc from your config in favor of the V1 Engine.
INFO 05-16 10:40:00 [config.py:1639] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-16 10:40:00 [api_server.py:249] Started engine process with PID 301
[W516 10:40:03.027198210 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
INFO 05-16 10:40:05 [__init__.py:239] Automatically detected platform xpu.
WARNING 05-16 10:40:06 [_logger.py:68] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 05-16 10:40:06 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-16 10:40:06 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3+ipexllm) with config: model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen/Qwen2.5-Coder-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-Coder-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING 05-16 10:40:06 [_logger.py:68] No existing RAY instance detected. A new instance will be launched with current node resources.
2025-05-16 10:40:07,652 INFO worker.py:1888 -- Started a local Ray instance.
INFO 05-16 10:40:08 [ray_utils.py:339] No current placement group found. Creating a new placement group.
INFO 05-16 10:40:08 [ray_distributed_executor.py:178] use_ray_spmd_worker: False
(pid=695) [W516 10:40:10.462430700 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
(pid=695) Overriding a previously registered kernel for the same operator and the same dispatch key
(pid=695) operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
(pid=695) registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
(pid=695) dispatch key: XPU
(pid=695) previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
(pid=695) new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
(pid=699) INFO 05-16 10:40:13 [__init__.py:239] Automatically detected platform xpu.
(WrapperWithLoadBit pid=699) INFO 05-16 10:40:14 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-16 10:40:14 [ray_distributed_executor.py:354] non_carry_over_env_vars from config: set()
INFO 05-16 10:40:14 [ray_distributed_executor.py:356] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_RPC_TIMEOUT', 'VLLM_TORCH_PROFILER_DIR', 'VLLM_USE_V1']
INFO 05-16 10:40:14 [ray_distributed_executor.py:359] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 05-16 10:40:15 [xpu.py:39] Cannot use None backend on XPU.
INFO 05-16 10:40:15 [xpu.py:45] Using IPEX attention backend.
(WrapperWithLoadBit pid=697) INFO 05-16 10:40:15 [xpu.py:39] Cannot use None backend on XPU.
(WrapperWithLoadBit pid=697) INFO 05-16 10:40:15 [xpu.py:45] Using IPEX attention backend.
INFO 05-16 10:40:15 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_f6fee00a'), local_subscribe_addr='ipc:///tmp/9635189e-b733-445c-abbf-ecb6a056cceb', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-16 10:40:15 [parallel_state.py:957] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-16 10:40:15 [config.py:3339] cudagraph sizes specified by model runner [] is overridden by config []
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
(WrapperWithLoadBit pid=697) INFO 05-16 10:40:15 [parallel_state.py:957] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(WrapperWithLoadBit pid=697) INFO 05-16 10:40:15 [config.py:3339] cudagraph sizes specified by model runner [] is overridden by config []
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 9.14it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 7.18it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 7.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 8.81it/s]
INFO 05-16 10:40:15 [loader.py:447] Loading weights took 0.46 seconds
2025-05-16 10:40:15,946 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format......
2025-05-16 10:40:15,946 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-16 10:40:17,866 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-16 10:40:18,603 - ipex_llm.vllm.xpu.model_convert - INFO - Loading model weights took 2.0819 GB
(WrapperWithLoadBit pid=695) INFO 05-16 10:40:18 [loader.py:447] Loading weights took 3.62 seconds
(pid=693) INFO 05-16 10:40:13 [__init__.py:239] Automatically detected platform xpu. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WrapperWithLoadBit pid=697) 2025-05-16 10:40:20,484 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format......
(WrapperWithLoadBit pid=697) 2025-05-16 10:40:20,485 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=693) [W516 10:40:10.718158193 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden. [repeated 3x across cluster]
(pid=693) Overriding a previously registered kernel for the same operator and the same dispatch key [repeated 3x across cluster]
(pid=693) operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> () [repeated 3x across cluster]
(pid=693) registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 [repeated 3x across cluster]
(pid=693) dispatch key: XPU [repeated 3x across cluster]
(pid=693) previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477 [repeated 3x across cluster]
(pid=693) new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator()) [repeated 3x across cluster]
(WrapperWithLoadBit pid=699) 2025-05-16 10:40:20,615 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=697) 2025-05-16 10:40:25,538 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=697) 2025-05-16 10:40:26,686 - ipex_llm.vllm.xpu.model_convert - INFO - Loading model weights took 2.0819 GB
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_WORKER_COUNT changed to be 2 (default:1)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_ATL_SHM changed to be 1 (default:0)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_DG2_ALLREDUCE changed to be 1 (default:0)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 4 (default:-1)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
2025:05:16-10:40:27:( 301) |CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
*** SIGSEGV received at time=1747363228 on cpu 17 ***
PC: @ 0x7c6193a2205e (unknown) smr_map_to_endpoint
@ 0x7c62c5c75520 (unknown) (unknown)
[2025-05-16 10:40:28,988 E 301 301] logging.cc:496: *** SIGSEGV received at time=1747363228 on cpu 17 ***
[2025-05-16 10:40:28,989 E 301 301] logging.cc:496: PC: @ 0x7c6193a2205e (unknown) smr_map_to_endpoint
[2025-05-16 10:40:28,989 E 301 301] logging.cc:496: @ 0x7c62c5c75520 (unknown) (unknown)
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2806 in all_reduce
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 81 in wrapper
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/device_communicators/xpu_communicator.py", line 22 in all_reduce
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 316 in _all_reduce_out_place
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 313 in all_reduce
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 421 in forward
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 324 in get_input_embeddings
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 337 in forward
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/compilation/decorators.py", line 172 in __call__
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 468 in forward
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_model_runner.py", line 963 in execute_model
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_model_runner.py", line 851 in profile_run
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 113 in determine_num_available_blocks
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/utils.py", line 2349 in run_method
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 612 in execute_method
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/executor/ray_distributed_executor.py", line 519 in _run_workers
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/executor/executor_base.py", line 331 in collective_rpc
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/executor/executor_base.py", line 103 in determine_num_available_blocks
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/engine/llm_engine.py", line 433 in _initialize_kv_caches
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/engine/llm_engine.py", line 284 in __init__
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/engine/multiprocessing/engine.py", line 82 in __init__
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 329 in __init__
File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/engine/multiprocessing/engine.py", line 128 in from_vllm_config
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 348 in from_vllm_config
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 371 in run_mp_engine
File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.11/multiprocessing/spawn.py", line 135 in _main
File "/usr/lib/python3.11/multiprocessing/spawn.py", line 122 in spawn_main
File "<string>", line 1 in <module>
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, torch.xpu.errno, torch.xpu.sys, ruamel.yaml.clib._ruamel_yaml, _ruamel_yaml, PIL._imagingft, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, uvloop.loop, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, sentencepiece._sentencepiece, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, regex._regex, pyarrow.lib, pyarrow._json (total: 120)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 1174, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 1115, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 272, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Hi,
Could you please try setting the following environment variable before running the container?
export CCL_DG2_USM=1
On some Xeon systems, P2P (peer-to-peer) might not be supported, and enabling this option switches to USM (Shared Memory GPUDirect), which can help in such cases. (Note: This is typically needed on Core processors to enable USM. Xeon usually supports P2P and doesn't require it, but it’s worth a try if you're encountering issues.)
Let us know if this makes a difference.
Hi,
Could you please try setting the following environment variable before running the container?
export CCL_DG2_USM=1On some Xeon systems, P2P (peer-to-peer) might not be supported, and enabling this option switches to USM (Shared Memory GPUDirect), which can help in such cases. (Note: This is typically needed on Core processors to enable USM. Xeon usually supports P2P and doesn't require it, but it’s worth a try if you're encountering issues.)
Let us know if this makes a difference.
Hi @liu-shaojun Setting the export CCL_DG2_USM=1, OOM always is trigerred.
[W516 12:13:34.894788221 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
INFO 05-16 12:13:37 [__init__.py:239] Automatically detected platform xpu.
[W516 12:13:38.045409361 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
WARNING 05-16 12:13:38 [_logger.py:68] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
WARNING 05-16 12:13:38 [_logger.py:68] Warning: Please use `ipex_llm.vllm.xpu.entrypoints.openai.api_server` instead of `vllm.entrypoints.openai.api_server` to start the API server
INFO 05-16 12:13:38 [api_server.py:1080] vLLM API server version 0.8.3+ipexllm
INFO 05-16 12:13:38 [api_server.py:1081] args: Namespace(host=None, port=80, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='ray', pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen/Qwen2.5-Coder-7B-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, low_bit_model_path=None, low_bit_save_path=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, load_in_low_bit='fp8')
WARNING 05-16 12:13:38 [_logger.py:68] Casting torch.bfloat16 to torch.float16.
INFO 05-16 12:13:43 [config.py:604] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 05-16 12:13:43 [_logger.py:68] --disable-async-output-proc is not supported by the V1 Engine. Falling back to V0. We recommend to remove --disable-async-output-proc from your config in favor of the V1 Engine.
INFO 05-16 12:13:43 [config.py:1639] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-16 12:13:43 [api_server.py:249] Started engine process with PID 301
[W516 12:13:45.312811962 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
INFO 05-16 12:13:47 [__init__.py:239] Automatically detected platform xpu.
WARNING 05-16 12:13:48 [_logger.py:68] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 05-16 12:13:48 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-16 12:13:48 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3+ipexllm) with config: model='/data/Qwen/Qwen2.5-Coder-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen/Qwen2.5-Coder-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-Coder-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING 05-16 12:13:48 [_logger.py:68] No existing RAY instance detected. A new instance will be launched with current node resources.
2025-05-16 12:13:49,817 INFO worker.py:1888 -- Started a local Ray instance.
INFO 05-16 12:13:50 [ray_utils.py:339] No current placement group found. Creating a new placement group.
INFO 05-16 12:13:50 [ray_distributed_executor.py:178] use_ray_spmd_worker: False
(pid=689) [W516 12:13:52.600319436 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
(pid=689) Overriding a previously registered kernel for the same operator and the same dispatch key
(pid=689) operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
(pid=689) registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
(pid=689) dispatch key: XPU
(pid=689) previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
(pid=689) new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
(pid=693) INFO 05-16 12:13:55 [__init__.py:239] Automatically detected platform xpu.
(WrapperWithLoadBit pid=687) INFO 05-16 12:13:56 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-16 12:13:57 [ray_distributed_executor.py:354] non_carry_over_env_vars from config: set()
INFO 05-16 12:13:57 [ray_distributed_executor.py:356] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_RPC_TIMEOUT', 'VLLM_TORCH_PROFILER_DIR', 'VLLM_USE_V1']
INFO 05-16 12:13:57 [ray_distributed_executor.py:359] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 05-16 12:13:57 [xpu.py:39] Cannot use None backend on XPU.
INFO 05-16 12:13:57 [xpu.py:45] Using IPEX attention backend.
(WrapperWithLoadBit pid=685) INFO 05-16 12:13:57 [xpu.py:39] Cannot use None backend on XPU.
(WrapperWithLoadBit pid=685) INFO 05-16 12:13:57 [xpu.py:45] Using IPEX attention backend.
INFO 05-16 12:13:57 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4d4434b3'), local_subscribe_addr='ipc:///tmp/19ab4468-1b3d-49fd-926e-5e5ced9ba70b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-16 12:13:57 [parallel_state.py:957] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-16 12:13:57 [config.py:3339] cudagraph sizes specified by model runner [] is overridden by config []
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
(WrapperWithLoadBit pid=685) INFO 05-16 12:13:57 [parallel_state.py:957] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(WrapperWithLoadBit pid=685) INFO 05-16 12:13:57 [config.py:3339] cudagraph sizes specified by model runner [] is overridden by config []
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 9.21it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 7.64it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 7.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 9.22it/s]
INFO 05-16 12:13:58 [loader.py:447] Loading weights took 0.44 seconds
2025-05-16 12:13:58,699 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format......
2025-05-16 12:13:58,700 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-16 12:14:00,277 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-05-16 12:14:01,023 - ipex_llm.vllm.xpu.model_convert - INFO - Loading model weights took 2.0819 GB
(WrapperWithLoadBit pid=687) INFO 05-16 12:14:01 [loader.py:447] Loading weights took 3.29 seconds
(pid=685) INFO 05-16 12:13:55 [__init__.py:239] Automatically detected platform xpu. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WrapperWithLoadBit pid=685) INFO 05-16 12:13:56 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available. [repeated 3x across cluster]
(WrapperWithLoadBit pid=687) 2025-05-16 12:14:02,733 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format......
(WrapperWithLoadBit pid=687) 2025-05-16 12:14:02,734 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=685) [W516 12:13:53.009836931 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden. [repeated 3x across cluster]
(pid=685) Overriding a previously registered kernel for the same operator and the same dispatch key [repeated 3x across cluster]
(pid=685) operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> () [repeated 3x across cluster]
(pid=685) registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 [repeated 3x across cluster]
(pid=685) dispatch key: XPU [repeated 3x across cluster]
(pid=685) previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477 [repeated 3x across cluster]
(pid=685) new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator()) [repeated 3x across cluster]
(WrapperWithLoadBit pid=685) 2025-05-16 12:14:03,665 - ipex_llm.transformers.utils - INFO - Converting the current model to fp8_e5m2 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=687) 2025-05-16 12:14:07,773 - ipex_llm.transformers.utils - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=687) 2025-05-16 12:14:09,187 - ipex_llm.vllm.xpu.model_convert - INFO - Loading model weights took 2.0819 GB
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_WORKER_COUNT changed to be 2 (default:1)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_ATL_SHM changed to be 1 (default:0)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_DG2_ALLREDUCE changed to be 1 (default:0)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 4 (default:-1)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
2025:05:16-12:14:11:( 301) |CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
(WrapperWithLoadBit pid=685) *** SIGSEGV received at time=1747368852 on cpu 16 ***
(WrapperWithLoadBit pid=685) PC: @ 0x71fe06a2205e (unknown) smr_map_to_endpoint
(WrapperWithLoadBit pid=685) @ 0x7225f16a2733 (unknown) (unknown)
(WrapperWithLoadBit pid=685) [2025-05-16 12:14:12,344 E 685 685] logging.cc:496: *** SIGSEGV received at time=1747368852 on cpu 16 ***
(WrapperWithLoadBit pid=685) [2025-05-16 12:14:12,345 E 685 685] logging.cc:496: PC: @ 0x71fe06a2205e (unknown) smr_map_to_endpoint
(WrapperWithLoadBit pid=685) [2025-05-16 12:14:12,345 E 685 685] logging.cc:496: @ 0x7225f16a2733 (unknown) (unknown)
(WrapperWithLoadBit pid=685) Fatal Python error: Segmentation fault
(WrapperWithLoadBit pid=685)
(WrapperWithLoadBit pid=685) Stack (most recent call first):
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2806 in all_reduce
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 81 in wrapper
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/device_communicators/xpu_communicator.py", line 22 in all_reduce
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 316 in _all_reduce_out_place
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/parallel_state.py", line 313 in all_reduce
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/layers/vocab_parallel_embedding.py", line 421 in forward
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 324 in get_input_embeddings
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 337 in forward
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/compilation/decorators.py", line 172 in __call__
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 468 in forward
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_model_runner.py", line 963 in execute_model
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_model_runner.py", line 851 in profile_run
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py", line 113 in determine_num_available_blocks
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/utils.py", line 2349 in run_method
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/vllm-0.8.3+ipexllm.xpu-py3.11-linux-x86_64.egg/vllm/worker/worker_base.py", line 612 in execute_method
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 946 in main_loop
(WrapperWithLoadBit pid=685) File "/usr/local/lib/python3.11/dist-packages/ray/_private/workers/default_worker.py", line 330 in <module>
(WrapperWithLoadBit pid=685)
(WrapperWithLoadBit pid=685) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, markupsafe._speedups, PIL._imaging, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, torch.xpu.errno, torch.xpu.sys, ruamel.yaml.clib._ruamel_yaml, _ruamel_yaml, PIL._imagingft, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, msgspec._core, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, pyarrow.lib, pyarrow._json, sentencepiece._sentencepiece, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box (total: 119)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff32cd6ffa69a173ea1eecb8f501000000 Worker ID: ce3485038454363b22a7c7dbd56fb9a471b2293c3aacf1470a8fd34e Node ID: 12ade23682c25a7f7e904f35e25157b7dd5dedf0e8d75ebb62b43999 Worker IP address: 172.19.0.2 Worker port: 33983 Worker PID: 685 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WrapperWithLoadBit pid=687) INFO 05-16 12:13:57 [xpu.py:39] Cannot use None backend on XPU. [repeated 2x across cluster]
(WrapperWithLoadBit pid=687) INFO 05-16 12:13:57 [xpu.py:45] Using IPEX attention backend. [repeated 2x across cluster]
(WrapperWithLoadBit pid=687) INFO 05-16 12:13:57 [parallel_state.py:957] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3 [repeated 2x across cluster]
(WrapperWithLoadBit pid=687) INFO 05-16 12:13:57 [config.py:3339] cudagraph sizes specified by model runner [] is overridden by config [] [repeated 2x across cluster]
(WrapperWithLoadBit pid=685) INFO 05-16 12:14:01 [loader.py:447] Loading weights took 3.99 seconds [repeated 2x across cluster]
@hualongfeng reached out to me via Teams. After syncing, we found that the segmentation fault error he encountered when using multiple GPUs was intermittent. They had written a code generation script that starts and stops a vLLM container using Docker Compose. When he repeatedly ran this script to test its stability, the segmentation fault would occasionally occur at random.
I also consulted with @xiangyuT , but we currently don’t have any concrete ideas. I advised Hualong to avoid this kind of repeated operation for now, and we’ll follow up with suggestions once we have better insights.