DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

tp > 1 inference is very slow

Open fan-niu opened this issue 8 months ago • 2 comments

@mrwyattii Use latest main branch and test model is llamav2-7b. When I use tp=4 to test a single sentence inference, it costs 267.98s, but when tp=1, it costs 7s to test a single sentence inference. This result is very strange. Can you please take a look?

In addition, for concurrent testing, I modified DeepSpeed-MII/mii/backend/client.py:L73 return self.asyncio_loop.run_until_complete(self._request_async_response(request_dict,**query_kwargs)) ---> return asyncio.create_task(self._request_async_response(request_dict,**query_kwargs))

If there is a problem with my modification, can you provide an example that supports concurrent client testing? Thank you very much!

fan-niu avatar Nov 14 '23 03:11 fan-niu

Hi @easonfzw that TP=4 time certainly seems bad! I just tested with the latest main branch and here is what I see on a 2xA6000 setup:

import mii
import time

client = mii.serve(
    "meta-llama/Llama-2-7b-hf",
    tensor_parallel=1,
)
start = time.time()
output = client.generate("DeepSpeed is", max_length=1024, ignore_eos=True)
end = time.time()
client.terminate_server()
tp1_time = end - start

client = mii.serve(
    "meta-llama/Llama-2-7b-hf",
    tensor_parallel=2,
)
start = time.time()
output = client.generate("DeepSpeed is", max_length=1024, ignore_eos=True)
end = time.time()
client.terminate_server()
tp2_time = end - start

print("TP1 time:", tp1_time)
print("TP2 time:", tp2_time)

Output:

TP1 time: 22.425052165985107
TP2 time: 13.85570764541626

Could you share more about your setup: What GPUs are you using, what version of CUDA do you have, what version of pytorch do you have?

I'm not sure about the modification you have made. I will need to dive into the code a bit to understand if this would have any negative impact on performance. For multiple client testing, we are spawning multiple processes. For example, you could do something like this:

import subprocess
processes = []
for i in range(32):
    processes.append(
        subprocess.Popen(
            [
                "python",
                "-c",
                f"import mii; mii.client('meta-llama/Llama-2-7b-hf')('DeepSpeed is', ignore_eos=True, max_length=256)",
            ],
            stdout=subprocess.PIPE,
        )
    )

Are you wanting multiple clients in a single process for benchmarking purposes?

mrwyattii avatar Nov 14 '23 17:11 mrwyattii

@mrwyattii First thank for your reply :)

What is very strange is that I used your above example of tp=1 and tp=2 for testing. tp=2 costs a lot of time. Looking forward to your reply :)

TP1 time: 8.999823808670044 TP2 time: 337.3766210079193 Env info: H100(80GB) 1*gpu NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.2 Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux torch 2.1.0a0+32f93b1 transformers 4.34.0 flash-attn 2.3.2

TP1 time: 15.307891845703125 TP2 time: 55.08188509941101 Env info: A100(40GB) 1*gpu NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 12.1
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux torch 2.0.0 transformers 4.34.0 flash-attn 2.3.2

DeepSpeed: commit 4388a605f854db91302c4f89053ee861eb31bacd DeepSpeed-Kernels: commit b62777e8ba87d82689b40625067f58a683bf7788 DeepSpeed-MII: commit ddbc6fc11b914abc2f166f346845f2476f61bfe7

In addition, when I use the subprocess sample code you provided, an error will be reported:

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
    self.flush()
  File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
    self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe

fan-niu avatar Nov 16 '23 06:11 fan-niu