ipex-llm
ipex-llm copied to clipboard
Slow text generation on dual Arc A770's w/ vLLM
Hello!
Followed the quickstart guide regarding vLLM serving through the available Docker image. I'm using 2 x Arc A770's in my system. When configured and running on a single GPU, inference speed is fantastic and text generation speed is good (around 14-15t/s and 8-9t/s, respectively). When setting tensor_parallel_size and pipeline_parallel_size to 2 to scale to both GPUs, inference speed doubles, however text generation speed halves, down to 3-4t/s.
Below is my start-vllm-service.sh config: #!/bin/bash model="/llm/models/llama-3-1-instruct" served_model_name="Llama-3.1"
export CCL_WORKER_COUNT=2 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1
export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 8192
--max-num-batched-tokens 10000
--max-num-seqs 256
--block-size 8
--tensor-parallel-size 2
--pipeline-parallel-size 2
Maybe I'm missing something, maybe I'm not. I did read to set the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to 1 for a performance boost, but set back to 2 during troubleshooting.
Thanks for taking the time to read! Hoping someone has an answer.
Just wanted to update and say that I removed the pipeline-parallel-size line as it was throwing errors about device_id's, but that the text generation speed still hasn't gone above 5t/s.
Hi, I am trying to reproduce this issue in my environment. Will update to this thread.
Hi, could you tell me how you tested the performance mentioned in the thread(around 14-15t/s and 8-9t/s, respectively)?
Also could you share the results of this script?
https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh
Hi there. Here is the output of the script:
PYTHON_VERSION=3.11.10
transformers=4.44.2
torch=2.1.0.post2+cxx11.abi
ipex-llm DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/oneccl_bind_pt-2.1.300+xpu-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 Version: 2.2.0b20241011
ipex=2.1.30.post0
CPU Information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 36 On-line CPU(s) list: 0-35 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz CPU family: 6 Model: 79 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 1 Stepping: 1 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 4197.80
Total CPU Memory: 125.722 GB
Operating System: Ubuntu 22.04.4 LTS \n \l
Linux neutronserver 6.8.12-Unraid #3 SMP PREEMPT_DYNAMIC Tue Jun 18 07:52:57 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
CLI: Version: 1.2.13.20230704 Build ID: 00000000
Service: Version: 1.2.13.20230704 Build ID: 00000000 Level Zero Version: 1.14.0
Driver Version 2024.17.5.0.08_160000.xmain-hotfix Driver Version 2024.17.5.0.08_160000.xmain-hotfix Driver UUID 32332e33-352e-3237-3139-312e39000000 Driver Version 23.35.27191.9 Driver UUID 32332e33-352e-3237-3139-312e39000000 Driver Version 23.35.27191.9
Driver related package version: ii intel-level-zero-gpu 1.3.27191.9 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero. ii level-zero-dev 1.14.0-744~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
igpu not detected
xpu-smi is properly installed.
+-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel Corporation Device 56a0 (rev 08) | | | Vendor Name: Intel(R) Corporation | | | UUID: 00000000-0000-0003-0000-000856a08086 | | | PCI BDF Address: 0000:03:00.0 | | | DRM Device: /dev/dri/card1 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 1 | Device Name: Intel Corporation Device 56a0 (rev 08) | | | Vendor Name: Intel(R) Corporation | | | UUID: 00000000-0000-0009-0000-000856a08086 | | | PCI BDF Address: 0000:09:00.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ lspci: Unable to load libkmod resources: error -2 GPU0 Memory size=16G GPU1 Memory size=16G
lspci: Unable to load libkmod resources: error -2 03:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller]) Subsystem: Device 172f:4133 Flags: bus master, fast devsel, latency 0, IRQ 69, NUMA node 0, IOMMU group 59 Memory at fa000000 (64-bit, non-prefetchable) [size=16M] Memory at 383800000000 (64-bit, prefetchable) [size=16G] Expansion ROM at fb000000 [disabled] [size=2M] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
09:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller]) Subsystem: Device 172f:4133 Flags: bus master, fast devsel, latency 0, IRQ 66, NUMA node 0, IOMMU group 52 Memory at f8000000 (64-bit, non-prefetchable) [size=16M] Memory at 383000000000 (64-bit, prefetchable) [size=16G] Expansion ROM at f9000000 [disabled] [size=2M] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Checked the token performance by loading up Llama-3.1 8B and running the prompt "Tell me about yourself" 3 times, to determine performance after warmup.
Apologies, the dual GPU scores were determined the same way as the single GPU scores: "Tell me about yourself" 3 seperate times as a prompt.
Single GPU: Inference 14t/s Text Gen 8t/s Dual GPU: Inference 30-50t/s (I've seen up to 50, was crazy) Text Gen 4-5t/s.
Hi, can you check if you have installed the out-of-tree driver on the host?
You can check it through the following command:
apt list | grep i915
# WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
# Example output if this has been installed...
# intel-i915-dkms/unknown 1.23.10.72.231129.76+i112-1 all [upgradable from: 1.23.10.54.231129.55+i87-1]
If you have not installed the out of tree driver, you can install it with our guide: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver
Besides, could you provide me with a executable bash script? I want to ensure that the commands I execute are exactly the same as yours, including the server startup and testing scripts, as well as the scripts for running inference and text generation.
The i915 driver is loaded on my unRAID host system. Other functions such as transcoding on GPU can be seen using the driver, so I know it is working.
The exact script I use to run vLLM is below: #!/bin/bash model="/llm/models/llama-3-1-instruct" served_model_name="Llama-3.1"
export CCL_WORKER_COUNT=2 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 export CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--enable-prefix-caching
--enable-chunked-prefill
--use-v2-block-manager
--load-in-low-bit fp8
--max-model-len 30000
--max-num-batched-tokens 40000
--max-num-seqs 512
--tensor-parallel-size 2
Ignore the high max model length, batched tokens and seqs; I've found that they do very little to improve perf. I've also found that the load in low bit doesn't do anything either to the generated tokens.
Could you please share the client script/code to send requests to the vLLM API server, so we can use it to get the inference/text generation TPS same as you.
I use openWebUI to connect to the vLLM OpenAI Capable API server, so it would follow the standard scheme from there. This is the log about what is sent to the server when a chat request is sent from OpenWebUI. INFO 10-17 09:59:17 logger.py:36] Received request chat-963e4e5e5afe443292c41933f907f9a7: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a world class A.I. model designed to generate high quality, reliable and true responses.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me about yourself<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=29941, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 2675, 527, 264, 1917, 538, 362, 2506, 13, 1646, 6319, 311, 7068, 1579, 4367, 11, 15062, 323, 837, 14847, 13, 128009, 128006, 882, 128007, 271, 41551, 757, 922, 6261, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
Ignore the goofy 'in hoodlum speak', that's for generating conversation topic names in OpenWebUI, which I found funny
Hi, I have tested the performance of the vLLM serving engine using your prompt. What we got is:
# Single card
first token: 100.4761297517689
next token: 15.190188343637834
# Multi card
first token: 69.21667874485138
next token: 21.78218797010891
The command for starting the test is listed below:
# For starting server:
#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct/"
served_model_name="Llama-3.1"
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 8192 \
--max-num-batched-tokens 10000 \
--max-num-seqs 256 \
--block-size 8 \
--tensor-parallel-size 2 # Change this to 1 for single card serving...
For test, you can use the vllm_online_benchmark.py script in the container... Change this line to your model path: https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/serving/xpu/docker/vllm_online_benchmark.py#L435
And change this line https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/serving/xpu/docker/vllm_online_benchmark.py#L459 to the prompt.
For instance: PROMPT = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a world class A.I. model designed to generate high quality, reliable and true responses.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me about yourself<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Besides, to test the performance of vLLM engine, we recommend to use vllm_online_benchmark.py or benchmark_vllm_throughput.py
Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.
My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.
Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.
My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.
This test is testing the text generation performance. As far as I can tell, the OpenWebUI just sends request to the vLLM OpenAI API sevrer, which is exactly what vllm_online_benchmark.py does.
The 404 error might be caused by:
- proxy. Try setting export no_proxy="127.0.0.1,localhost"
- When starting vllm_online_benchmark.py, using the command:
python3 vllm_online_benchmark.py Llama-3.1 1. Also tries to apply the two modifications I mentioned in previous thread.
Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.
My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.
Adding a GPU does benefit the computation but also brings additional communication overhead, for 7B/8B/9B LLM models, the optimization time for computation on the next token is not enough to outweigh the communication overhead, so the next token latency will be slightly slower on dual cards than on single card, but the 1st token will be a lot faster, and what you mentioned slowed things down overall is what you observe with batch=1, short inputs, and next token latency accounting for a large portion of overall performance. Once you increase the batch i.e. serve multiple requests in parallel, or you use longer inputs, you will find that the throughput of overall is better for dual cards than single. BTW, for 7B/8B/9B LLM models, if you don't need to serve super long inputs or need faster 1st token latency, we recommend deploying one per card for the LLM service, by deploying multiple instances, we can retain the next token latency advantage of single card and get 2X throughput. For the ~14B model we recommend using two ARC A770s, and for the ~33B model, we recommend using four ARC A770s.
Thank you very much for this advice.
I did notice something very strange though. I decided to spin up a Llama.cpp instance using the ipex-llm guides and noticed SIGNIFICANTLY faster generation, which was in fact utilising both cards to generate. Inference speed was not as fast as vLLM, however text generation was the fastest I had seen overall.
Why might this be happening? I can achieve incredible results with llama.cpp but vLLM, which should be faster, is performing at 4-5 tokens per second for text generation, with the same model loaded.
Hi @HumerousGorgon, what's the difference between inference and text generation you mentioned here? Can you observe data as below? Or are there any other metrics that help you differentiate between inference and text generation?
# Single card
first token: 100.4761297517689
next token: 15.190188343637834
# Multi card
first token: 69.21667874485138
next token: 21.78218797010891
I'm going to run those benchmarks now to see whether my performance is in line with the numbers here. My problem is not with inference, it's with the generation of text. With vLLM it was extremely slow, a couple of words every second (4-5t/s) but with Llama.cpp and the exact same model, I was seeing it generate whole sentences in a second.
We use the same method to use openwebui as the front end to deploy the llama3.1-8b model. The text generation speed of the web interface is very fast, and we can respond immediately after sending questions. The specific values of the test show that on one card it is about 40t/s, while on two cards it can stably reach more than 45t/s. For the 128-length prompt, the test results are as follows:
- one card
- two cards
Okay, something very wrong is happening with my vLLM instance, because I am seeing NOWHERE near those numbers. You are getting roughly 10x the performance I am. Could you share your script to start it?
We have updated our openwebui+vllm-serving workflow here. Docker start scipt is here, please update the docker image before start it. Frontend and backend startup scripts are as follows, note change <api-key> to any string and <vllm-host-ip> to your host ipv4:
vLLM Serving start script:
#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct"
served_model_name="Meta-Llama-3.1-8B-Instruct"
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--max-num-seqs 12 \
--api-key <api-key> \
--tensor-parallel-size 2
open-webui start script:
#!/bin/bash
export DOCKER_IMAGE=ghcr.io/open-webui/open-webui:main
export CONTAINER_NAME=open-webui
docker rm -f $CONTAINER_NAME
docker run -itd \
-p 3000:8080 \
-e OPENAI_API_KEY=<api-key> \
-e OPENAI_API_BASE_URL=http://<vllm-host-ip>:8000/v1 \
-v open-webui:/app/backend/data \
--name $CONTAINER_NAME \
--restart always $DOCKER_IMAGE
Hey there.
I used this script exactly as you had put it, updated my container to the latest version... 3-4 tokens per second for text generation. I am starting to wonder if this is an issue with Docker rather than an issue with vLLM. I'm going to configure a VM with the GPUs passed through and see if I can fix this issue.
As you can see, even with the newest branch, this is just flat out not working the way everyone else's is. I'm using unRAID as my host machine. There is one difference: in order to get my GPU is that I have to pass it two environment variables: -e OverrideGpuAddressSpace=48, -e NEOReadDebugKeys=1
This is the only fundamental difference here. I'm starting to lose my mind. I'm also wondering whether my Xeon E5-2695v4 is limiting the GPUs? I have reBAR enabled and GPU acceleration works in other areas, such as video encoding.
Hi @HumerousGorgon
We recommend setting up your environment with Ubuntu 22.04 and Kernel 6.5, and installing the Intel-i915-dkms driver for optimal performance. Once you have Ubuntu OS set up, please follow the steps below to install Kernel 6.5:
export VERSION="6.5.0-35"
sudo apt-get install -y linux-headers-$VERSION-generic
sudo apt-get install -y linux-image-$VERSION-generic
sudo apt-get install -y linux-modules-$VERSION-generic # may not be needed
sudo apt-get install -y linux-modules-extra-$VERSION-generic
After installing, you can configure GRUB to use the new kernel by running the following:
sudo sed -i "s/GRUB_DEFAULT=.*/GRUB_DEFAULT=\"1> $(echo $(($(awk -F\' '/menuentry / {print $2}' /boot/grub/grub.cfg \
| grep -no $VERSION | sed 's/:/\n/g' | head -n 1)-2)))\"/" /etc/default/grub
Then, update GRUB and reboot your system:
sudo update-grub
sudo reboot
For detailed instructions on installing the driver, you can follow this link:
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#for-linux-kernel-65
Please feel free to reach out if you have any questions!
please also make sure "re-sizeable BAR support" and "above 4G mmio" are enabled.
Okay, so I set up an Ubuntu host using 22.04 and the 6.5 Kernel. I also used the i915-dkms driver. I'm now seeing 15 tokens per second, which is 3x the speed of my previous config, but still 3x slower than the numbers reported here.
I'm truly at a loss here and unsure of how to go on.
It might be related to the CPU/GPU frequency. You can try adjusting the CPU/GPU frequency to see if it has any impact.
For CPU frequency, you can use sudo cpupower frequency-info to check the frequency range, and then set it using sudo cpupower frequency-set -d 3.8GHz. In our case, the output of sudo cpupower frequency-info was:
analyzing CPU 30:
driver: intel_pstate
CPUs which run at the same hardware frequency: 30
CPUs which need to have their frequency coordinated by software: 30
maximum transition latency: Cannot determine or is not supported.
hardware limits: 800 MHz - 4.50 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 3.80 GHz and 4.50 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.80 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
We set the CPU frequency to 3.8GHz for optimal performance:
sudo cpupower frequency-set -d 3.8GHz
For GPU, you can set the frequency using the following commands:
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400
Let us know if adjusting the frequencies helps improve the performance!
We use the same method to use openwebui as the front end to deploy the llama3.1-8b model. The text generation speed of the web interface is very fast, and we can respond immediately after sending questions. The specific values of the test show that on one card it is about 40t/s, while on two cards it can stably reach more than 45t/s. For the 128-length prompt, the test results are as follows:
- one card
- two cards
So I went ahead and purchased a WHOLE ENTIRE NEW SYSTEM 11600kf, Z590 Vision D, 32GB of 3200MHz C16 RAM, same 2 Arc A770's... 11 tokens per second on text generation! What am I doing wrong?! 6.5 kernel, rebar enabled, everything down to the letter with what's been shown. Is it the fact that I'm using the straight Llama-3.1 model from Meta, no quantisation? I have no idea anymore....
Also, removing tensor parallel results in 22 tokens per second, so I'm still receiving half of the performance that other group members are experiencing.
And finally, setting cpu frequency to 3.8GHz did for a single second set the output speed to 40 tokens per second, but it then dropped back down to 20 tokens per second
Finally finally, I had changed the governer to performance, still no change.

