ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

vLLM freezes with gpu-memory-utilization > 0.55

Open nathanodle opened this issue 3 weeks ago • 3 comments

Running vllm according to instructions. Docker segfaults at startup, so I'm running straight on the machine.

Starting server with the following shell script. As you can see I've tried to turn max-model-len, max-num-batched-tokens, and max-num-seqs down as much as I can for my use case just to get inference to run. With this configuration I get an error (more below) with a prompt of around 7200 tokens:

#!/bin/bash
model="/home/aiml/models/gradientai/Llama-3-8B-Instruct-Gradient-1048k"
served_model_name="ensemble"

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1

export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.55 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 1 \
  --load-in-low-bit sym_int4 \
  --tensor-parallel-size 4

I have to turn gpu-memory-utilization down to 0.55 or the service won't start. With even 0.65, the service just freezes with several CPU cores pegged at 100% and the following output:

/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-06-25 08:19:41,903 - INFO - intel_extension_for_pytorch auto imported
INFO 06-25 08:19:42 api_server.py:258] vLLM API server version 0.3.3
INFO 06-25 08:19:42 api_server.py:259] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='ensemble', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], load_in_low_bit='sym_int4', model='/home/aiml/models/gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=8192, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=4, gpu_memory_utilization=0.75, max_num_batched_tokens=8192, max_num_seqs=1, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='xpu', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 06-25 08:19:42 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 06-25 08:19:42 config.py:523] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-06-25 08:19:44,717 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-25 08:19:45 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/aiml/models/gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer='/home/aiml/models/gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=8192, max_num_seqs=1, max_model_len=8192)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=68203) /home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(RayWorkerVllm pid=68203)   warn(
(RayWorkerVllm pid=68203) 2024-06-25 08:19:56,470 - INFO - intel_extension_for_pytorch auto imported
INFO 06-25 08:19:56 attention.py:71] flash_attn is not found. Using xformers backend.
2024-06-25 08:19:57,531 - INFO - Converting the current model to sym_int4 format......
2024-06-25 08:19:57,531 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(RayWorkerVllm pid=68203) INFO 06-25 08:19:58 attention.py:71] flash_attn is not found. Using xformers backend.
2024-06-25 08:20:00,041 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 06-25 08:20:00 model_convert.py:181] Loading model weights took 1.6073 GB
(RayWorkerVllm pid=68291) 2024-06-25 08:20:02,898 - INFO - Converting the current model to sym_int4 format......
(RayWorkerVllm pid=68291) 2024-06-25 08:20:02,898 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(RayWorkerVllm pid=68509) /home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=68509)   warn( [repeated 2x across cluster]
(RayWorkerVllm pid=68509) 2024-06-25 08:19:56,471 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
(RayWorkerVllm pid=68203) 2024-06-25 08:20:02,930 - INFO - Converting the current model to sym_int4 format...... [repeated 2x across cluster]
(RayWorkerVllm pid=68203) 2024-06-25 08:20:09,774 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(RayWorkerVllm pid=68203) INFO 06-25 08:20:10 model_convert.py:181] Loading model weights took 1.6073 GB
(RayWorkerVllm pid=68509) INFO 06-25 08:19:58 attention.py:71] flash_attn is not found. Using xformers backend. [repeated 2x across cluster]
2024:06:25-08:20:10:(60656) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2024:06:25-08:20:10:(60656) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
(RayWorkerVllm pid=68203) 2024:06:25-08:20:11:(68203) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
(RayWorkerVllm pid=68203) 2024:06:25-08:20:11:(68203) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
INFO 06-25 08:20:22 ipex_llm_gpu_executor.py:262] # GPU blocks: 19918, # CPU blocks: 8192

With gpu-memory-utilization set to 0.55, the service starts but with longer ctx it errors with this output:

INFO 06-25 08:24:43 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%
INFO 06-25 08:24:53 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%
INFO 06-25 08:25:03 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%
INFO 06-25 08:25:13 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%
ERROR 06-25 08:25:15 async_llm_engine.py:41] Engine background task failed
ERROR 06-25 08:25:15 async_llm_engine.py:41] Traceback (most recent call last):
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
ERROR 06-25 08:25:15 async_llm_engine.py:41]     task.result()
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop
ERROR 06-25 08:25:15 async_llm_engine.py:41]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-25 08:25:15 async_llm_engine.py:41]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 484, in wait_for
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return fut.result()
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/engine/async_llm_engine.py", line 441, in engine_step
ERROR 06-25 08:25:15 async_llm_engine.py:41]     request_outputs = await self.engine.step_async()
ERROR 06-25 08:25:15 async_llm_engine.py:41]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/engine/async_llm_engine.py", line 211, in step_async
ERROR 06-25 08:25:15 async_llm_engine.py:41]     output = await self.model_executor.execute_model_async(
ERROR 06-25 08:25:15 async_llm_engine.py:41]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async
ERROR 06-25 08:25:15 async_llm_engine.py:41]     all_outputs = await self._run_workers_async(
ERROR 06-25 08:25:15 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async
ERROR 06-25 08:25:15 async_llm_engine.py:41]     all_outputs = await asyncio.gather(*coros)
ERROR 06-25 08:25:15 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 689, in _wrap_awaitable
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return (yield from awaitable.__await__())
ERROR 06-25 08:25:15 async_llm_engine.py:41]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerVllm.execute_method() (pid=76827, ip=192.168.166.200, actor_id=2cbb20a42c10d73d6607368901000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x762be2f95110>)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return executor(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return func(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/worker/worker.py", line 236, in execute_model
ERROR 06-25 08:25:15 async_llm_engine.py:41]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-25 08:25:15 async_llm_engine.py:41]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return func(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/worker/model_runner.py", line 581, in execute_model
ERROR 06-25 08:25:15 async_llm_engine.py:41]     hidden_states = model_executable(
ERROR 06-25 08:25:15 async_llm_engine.py:41]                     ^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self._call_impl(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return forward_call(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/model_executor/models/llama.py", line 337, in forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 06-25 08:25:15 async_llm_engine.py:41]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self._call_impl(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return forward_call(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/model_executor/models/llama.py", line 267, in forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     hidden_states, residual = layer(
ERROR 06-25 08:25:15 async_llm_engine.py:41]                               ^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self._call_impl(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return forward_call(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/model_executor/models/llama.py", line 216, in forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     hidden_states = self.self_attn(
ERROR 06-25 08:25:15 async_llm_engine.py:41]                     ^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self._call_impl(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return forward_call(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/ipex_llm/vllm/xpu/model_convert.py", line 50, in _Attention_forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     attn_output = self.attn(q, k, v, k_cache, v_cache, input_metadata)
ERROR 06-25 08:25:15 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self._call_impl(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return forward_call(*args, **kwargs)
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/model_executor/layers/attention/attention.py", line 62, in forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     return self.backend.forward(query, key, value, key_cache, value_cache,
ERROR 06-25 08:25:15 async_llm_engine.py:41]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41]   File "/home/aiml/code/ipex-vllm/vllm/vllm/model_executor/layers/attention/backends/torch_sdpa.py", line 94, in forward
ERROR 06-25 08:25:15 async_llm_engine.py:41]     out = torch.nn.functional.scaled_dot_product_attention(
ERROR 06-25 08:25:15 async_llm_engine.py:41]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-25 08:25:15 async_llm_engine.py:41] RuntimeError: Allocation is out of device memory on current platform.

All drivers, etc are up to date according to apt.

Thanks for any assistance!

nathanodle avatar Jun 25 '24 08:06 nathanodle