FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Slower inference with vLLM worker on 4 A100

Open tacacs1101-debug opened this issue 1 year ago • 9 comments

I deployed wizardLLM-70b which is fine-tuned variant of llama2-70b on 4 A100 (80 GB) using vLLM worker. I noticed a much slower response (more than a minute even for a simple prompt like Hi) at a throughput of 0.2 tok/sec . My tensor-parallelism was set to 4 in this case.

When I deployed the same model with 2 A100 (80GB). I noticed a much higher throughput and lower latency. I achieved throughput ~700 tok/sec . Why this is so. I assumed that using 4 A100 will deliver much higher throughput and lower latency because tensor-parallelism in this case is 4 and also I have a lot of GPU KV cache in this case. Do anyone have any explaination or am I doing something wrong.

tacacs1101-debug avatar Nov 28 '23 16:11 tacacs1101-debug

While that would be true for games, for LLMs is not true that more GPUs == more performance. Turns out that there's a lot of data movement going on among different areas of a model during inference, and this goes through the PCI express or NVLINK, which is orders of magnitude slower than movements in ram.

Check with smaller models, try the same thing: one gpu, two and then four. You will see a drastic performance reduction when you scale your rig.

surak avatar Nov 29 '23 17:11 surak

@tacacs1101-debug @surak

I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing vLLM + Azure + OpenAI. It can process (500+ requests/second)

From the thread it looks like you're trying to maximize throughput (i'd love feedback if you're trying to do this)

Here's the quick start:

Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

Step 1 Create a Config.yaml

model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      api_key: 
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

ishaan-jaff avatar Nov 29 '23 18:11 ishaan-jaff

I am not sure this applies here. The OP is talking about local inference on a single compute node with 4gpus. Are we talking about the same thing?

surak avatar Nov 29 '23 18:11 surak

a throughput of 0.2 tok/sec .

@tacacs1101-debug this doesn't seem correct. can you provide commands to reproduce?

infwinston avatar Nov 29 '23 20:11 infwinston

@surak Absolutely correct, I am talking about local inference on single compute node with 4 A100 (80 Gi). Our throughput is good even on 2 A100 but my assumption was that using 4 A100's, we can increase the degree of tensor parallelism to 4 and that would translate to some reduction in latency. I have also noted that using 3 A100, there is no such difference in throughput or latency and third gpu is almost unutilised as in this case I have to forcibly set degree of tensor parallelism to 2. I understand that increasing the number of GPU doesn't translate to increase in performance due to the GPU overhead but it should not drop as drastically as from ~500/tok sec to 0.2 token/sec .

tacacs1101-debug avatar Nov 30 '23 08:11 tacacs1101-debug

@infwinston I am using Helm chart for deployment by creating custom docker image but the command is similar to

python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

tacacs1101-debug avatar Nov 30 '23 08:11 tacacs1101-debug

this command does not use vllm so it will be slow.

python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

you have to use vllm worker for better tensor parallelism speed. see https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

or did I misunderstand?

infwinston avatar Nov 30 '23 08:11 infwinston

@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

tacacs1101-debug avatar Nov 30 '23 09:11 tacacs1101-debug

@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

how's the inference speed?

ruifengma avatar Apr 12 '24 08:04 ruifengma

There was some hardware issue with 2 gpus, they were not able to communicate with each other. Checked nvlink status and then upgraded the gpu drivers to resolve it.

tacacs1101-debug avatar Aug 06 '24 14:08 tacacs1101-debug