DeepSpeed-MII
DeepSpeed-MII copied to clipboard
tp > 1 inference is very slow
@mrwyattii Use latest main branch and test model is llamav2-7b. When I use tp=4 to test a single sentence inference, it costs 267.98s, but when tp=1, it costs 7s to test a single sentence inference. This result is very strange. Can you please take a look?
In addition, for concurrent testing, I modified DeepSpeed-MII/mii/backend/client.py:L73
return self.asyncio_loop.run_until_complete(self._request_async_response(request_dict,**query_kwargs)) ---> return asyncio.create_task(self._request_async_response(request_dict,**query_kwargs))
If there is a problem with my modification, can you provide an example that supports concurrent client testing? Thank you very much!
Hi @easonfzw that TP=4 time certainly seems bad! I just tested with the latest main
branch and here is what I see on a 2xA6000 setup:
import mii
import time
client = mii.serve(
"meta-llama/Llama-2-7b-hf",
tensor_parallel=1,
)
start = time.time()
output = client.generate("DeepSpeed is", max_length=1024, ignore_eos=True)
end = time.time()
client.terminate_server()
tp1_time = end - start
client = mii.serve(
"meta-llama/Llama-2-7b-hf",
tensor_parallel=2,
)
start = time.time()
output = client.generate("DeepSpeed is", max_length=1024, ignore_eos=True)
end = time.time()
client.terminate_server()
tp2_time = end - start
print("TP1 time:", tp1_time)
print("TP2 time:", tp2_time)
Output:
TP1 time: 22.425052165985107
TP2 time: 13.85570764541626
Could you share more about your setup: What GPUs are you using, what version of CUDA do you have, what version of pytorch do you have?
I'm not sure about the modification you have made. I will need to dive into the code a bit to understand if this would have any negative impact on performance. For multiple client testing, we are spawning multiple processes. For example, you could do something like this:
import subprocess
processes = []
for i in range(32):
processes.append(
subprocess.Popen(
[
"python",
"-c",
f"import mii; mii.client('meta-llama/Llama-2-7b-hf')('DeepSpeed is', ignore_eos=True, max_length=256)",
],
stdout=subprocess.PIPE,
)
)
Are you wanting multiple clients in a single process for benchmarking purposes?
@mrwyattii First thank for your reply :)
What is very strange is that I used your above example of tp=1 and tp=2 for testing. tp=2 costs a lot of time. Looking forward to your reply :)
TP1 time: 8.999823808670044 TP2 time: 337.3766210079193 Env info: H100(80GB) 1*gpu NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.2 Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux torch 2.1.0a0+32f93b1 transformers 4.34.0 flash-attn 2.3.2
TP1 time: 15.307891845703125
TP2 time: 55.08188509941101
Env info:
A100(40GB) 1*gpu
NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 12.1
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
torch 2.0.0
transformers 4.34.0
flash-attn 2.3.2
DeepSpeed: commit 4388a605f854db91302c4f89053ee861eb31bacd DeepSpeed-Kernels: commit b62777e8ba87d82689b40625067f58a683bf7788 DeepSpeed-MII: commit ddbc6fc11b914abc2f166f346845f2476f61bfe7
In addition, when I use the subprocess sample code you provided, an error will be reported:
Traceback (most recent call last):
File "/usr/lib/python3.10/logging/__init__.py", line 1104, in emit
self.flush()
File "/usr/lib/python3.10/logging/__init__.py", line 1084, in flush
self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe