vllm [Bug]: Chunk Prefill feature fails for ppc64le (IBM POWER)

[Bug]: Chunk Prefill feature fails for ppc64le (IBM POWER)

Open Akashcodes732 opened this issue 2 weeks ago • 0 comments

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.7.0a0+gitd0f5df8
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.5 (Plow) (ppc64le)
GCC version: (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
Clang version: 18.1.8 (Red Hat, Inc. 18.1.8-3.el9)
CMake version: version 3.31.2
Libc version: glibc-2.34

Python version: 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:07:52) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.14.0-503.23.1.el9_5.ppc64le-ppc64le-with-glibc2.34
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                         ppc64le
Byte Order:                           Little Endian
CPU(s):                               320
On-line CPU(s) list:                  0-319
Model name:                           POWER10 (architected), altivec supported
Model:                                2.0 (pvr 0080 0200)
Thread(s) per core:                   8
Core(s) per socket:                   10
Socket(s):                            4
Hypervisor vendor:                    pHyp
Virtualization type:                  para
L1d cache:                            2.5 MiB (80 instances)
L1i cache:                            3.8 MiB (80 instances)
L2 cache:                             80 MiB (80 instances)
L3 cache:                             320 MiB (80 instances)
NUMA node(s):                         4
NUMA node0 CPU(s):                    0-79
NUMA node1 CPU(s):                    80-159
NUMA node2 CPU(s):                    160-239
NUMA node3 CPU(s):                    240-319
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Not affected
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] optree==0.13.1
[pip3] pyzmq==26.2.0
[pip3] torch==2.7.0a0+gitd0f5df8
[pip3] transformers==4.47.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] optree                    0.13.1                   pypi_0    pypi
[conda] pyzmq                     26.2.0          py311he15fa53_3    conda-forge
[conda] torch                     2.7.0a0+gitd0f5df8          pypi_0    pypi
[conda] transformers              4.47.1                   pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post2.dev103+ge512f76a.d20250129
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

LD_LIBRARY_PATH=/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/cv2/../../lib64:/home/akashk/miniconda3/envs/vllm_oob/lib:

🐛 Describe the bug

When we run chunked prefill enabled on ppc64le, it fails with an ipex dependency. As ipex does not work for ppc64le, we are seeing this error.

Test script

from vllm import LLM, SamplingParams
 
llm = LLM(model="facebook/opt-1.3b", enable_chunked_prefill=True)  # Enable chunked prefill
 
long_prompt = """Once upon a time in a faraway land, there was a wise old owl who lived in a hollow tree. 
This owl had seen many seasons pass and had gathered knowledge from all corners of the world..."""
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
 
outputs = llm.generate([long_prompt], sampling_params)
 
for output in outputs:
    print("Generated Text:", output.outputs[0].text)

Error Log

INFO 02-16 08:18:58 __init__.py:183] Automatically detected platform cpu.
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 653/653 [00:00<00:00, 5.22MB/s]
INFO 02-16 08:18:59 config.py:2280] For POWERPC, we cast models to bfloat16 instead of using float16 by default. Float16 is not currently supported for POWERPC.
WARNING 02-16 08:18:59 config.py:2324] Casting torch.float16 to torch.bfloat16.
INFO 02-16 08:19:04 config.py:526] This model supports multiple tasks: {'classify', 'reward', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 02-16 08:19:04 config.py:1494] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 02-16 08:19:04 config.py:662] Async output processing is not supported on the current platform type cpu.
WARNING 02-16 08:19:04 cpu.py:60] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 02-16 08:19:04 cpu.py:75] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 02-16 08:19:04 llm_engine.py:232] Initializing a V0 LLM engine (v0.1.dev4356+gb02fd28.d20250129) with config: model='facebook/opt-1.3b', speculative_config=None, tokenizer='facebook/opt-1.3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-1.3b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 6.84MB/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 14.9MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 40.1MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 5.82MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.51MB/s]
INFO 02-16 08:19:06 cpu.py:36] Cannot use None backend on CPU.
INFO 02-16 08:19:06 cpu.py:37] Using Torch SDPA backend.
INFO 02-16 08:19:06 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 02-16 08:19:06 weight_utils.py:251] Using model weights format ['*.bin']
pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.63G/2.63G [00:23<00:00, 114MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.05it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.01it/s]
 
INFO 02-16 08:19:30 executor_base.py:108] # CPU blocks: 1365, # CPU blocks: 0
INFO 02-16 08:19:30 executor_base.py:113] Maximum concurrency for 2048 tokens per request: 10.66x
INFO 02-16 08:19:31 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 0.17 seconds
Processed prompts:   0%|                                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]: Traceback (most recent call last):
[rank0]:   File "/home/akashk/vllm_oob/vllm/examples/test_chunked_prefill.py", line 14, in <module>
[rank0]:     outputs = llm.generate([long_prompt], sampling_params)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/utils.py", line 1074, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 467, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/entrypoints/llm.py", line 1388, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/engine/llm_engine.py", line 1384, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/executor/executor_base.py", line 136, in execute_model
[rank0]:     output = self.collective_rpc("execute_model",
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/utils.py", line 2208, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/worker/worker_base.py", line 411, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/worker/cpu_model_runner.py", line 655, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/model_executor/models/opt.py", line 368, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/compilation/decorators.py", line 170, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/model_executor/models/opt.py", line 323, in forward
[rank0]:     return self.decoder(input_ids,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/model_executor/models/opt.py", line 280, in forward
[rank0]:     hidden_states = layer(hidden_states,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/model_executor/models/opt.py", line 173, in forward
[rank0]:     hidden_states = self.self_attn(hidden_states=hidden_states,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/model_executor/models/opt.py", line 113, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/attention/layer.py", line 177, in forward
[rank0]:     return torch.ops.vllm.unified_attention(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/attention/layer.py", line 279, in unified_attention
[rank0]:     return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/akashk/miniconda3/envs/vllm_oob/lib/python3.11/site-packages/vllm-0.1.dev4356+gb02fd28.d20250129.cpu-py3.11-linux-ppc64le.egg/vllm/attention/backends/torch_sdpa.py", line 536, in forward
[rank0]:     import intel_extension_for_pytorch.llm.modules as ipex_modules
[rank0]: ModuleNotFoundError: No module named 'intel_extension_for_pytorch'
Processed prompts:   0%|                                                                                                                                       | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 17 '25 08:02 Akashcodes732

vllm vllm copied to clipboard

[Bug]: Chunk Prefill feature fails for ppc64le (IBM POWER)

Your current environment

🐛 Describe the bug

Test script

Error Log

Before submitting a new issue...

vllm
vllm copied to clipboard