FastChat
FastChat copied to clipboard
Slower inference with vLLM worker on 4 A100
I deployed wizardLLM-70b which is fine-tuned variant of llama2-70b on 4 A100 (80 GB) using vLLM worker. I noticed a much slower response (more than a minute even for a simple prompt like Hi) at a throughput of 0.2 tok/sec . My tensor-parallelism was set to 4 in this case.
When I deployed the same model with 2 A100 (80GB). I noticed a much higher throughput and lower latency. I achieved throughput ~700 tok/sec . Why this is so. I assumed that using 4 A100 will deliver much higher throughput and lower latency because tensor-parallelism in this case is 4 and also I have a lot of GPU KV cache in this case. Do anyone have any explaination or am I doing something wrong.
While that would be true for games, for LLMs is not true that more GPUs == more performance. Turns out that there's a lot of data movement going on among different areas of a model during inference, and this goes through the PCI express or NVLINK, which is orders of magnitude slower than movements in ram.
Check with smaller models, try the same thing: one gpu, two and then four. You will see a drastic performance reduction when you scale your rig.
@tacacs1101-debug @surak
I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing vLLM + Azure + OpenAI. It can process (500+ requests/second)
From the thread it looks like you're trying to maximize throughput (i'd love feedback if you're trying to do this)
Here's the quick start:
Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model
Step 1 Create a Config.yaml
model_list:
- model_name: gpt-4
litellm_params:
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
api_key:
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_key:
api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_key:
api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
Step 2: Start the litellm proxy:
litellm --config /path/to/config.yaml
Step3 Make Request to LiteLLM proxy:
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
I am not sure this applies here. The OP is talking about local inference on a single compute node with 4gpus. Are we talking about the same thing?
a throughput of 0.2 tok/sec .
@tacacs1101-debug this doesn't seem correct. can you provide commands to reproduce?
@surak Absolutely correct, I am talking about local inference on single compute node with 4 A100 (80 Gi). Our throughput is good even on 2 A100 but my assumption was that using 4 A100's, we can increase the degree of tensor parallelism to 4 and that would translate to some reduction in latency. I have also noted that using 3 A100, there is no such difference in throughput or latency and third gpu is almost unutilised as in this case I have to forcibly set degree of tensor parallelism to 2. I understand that increasing the number of GPU doesn't translate to increase in performance due to the GPU overhead but it should not drop as drastically as from ~500/tok sec to 0.2 token/sec .
@infwinston I am using Helm chart for deployment by creating custom docker image but the command is similar to
python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4
this command does not use vllm so it will be slow.
python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4
you have to use vllm worker for better tensor parallelism speed. see https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md
or did I misunderstand?
@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4
@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4
how's the inference speed?
There was some hardware issue with 2 gpus, they were not able to communicate with each other. Checked nvlink status and then upgraded the gpu drivers to resolve it.