DeepSpeed-MII
DeepSpeed-MII copied to clipboard
Deadlock detected
Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations.
Constantly see this issue when running below on a A100(40GiB) for llama2-7b.
import mii
from deepspeed.inference import RaggedInferenceEngineConfig, DeepSpeedTPConfig
from deepspeed.inference.v2.ragged import DSStateManagerConfig
tp_config = DeepSpeedTPConfig(tp_size=tensor_parallel)
mgr_config = DSStateManagerConfig(max_ragged_batch_size=1024,
max_ragged_sequence_count=1024)
inference_config = RaggedInferenceEngineConfig(tensor_parallel=tp_config,
state_manager=mgr_config)
llm = mii.serve(
model,
deployment_name='mii',
tensor_parallel=tensor_parallel,
inference_engine_config=inference_config,
replica_num=1,
task='text-generation'
)
outputs = llm.generate(prompts,
do_sample=False,
top_p=1.0,
max_new_tokens=max_new_tokens)
The deadlock is caused when we detect that we are not making any progress on any of the generation tasks. This can happen for a few reasons, including lots of concurrent generation requests, very long sequences, or limited GPU memory. Our current solution for this will hurt performance if you are seeing it often. How many requests are you sending to the server at once time?
Also, I believe @tohtana is working on an improved solution to this problem.
I am sending a few hundred requests within one batch.
I am sending a few hundred requests within one batch.
If these requests are generating lots of tokens, then sending this many at once will definitely cause the deadlock situation. If you can send the requests in smaller batches, that would avoid the problem. However, I will let @tohtana comment on any upcoming changes that will allow users to send large batches of requests at once!
Hi @flexwang, DeepSpeed-FastGet (MII) allocates KV cache for all requests that are processed in a batch. To avoid this warning, a simple workaround is to reduce the number of requests in a batch. In your case, I recommend starting with 10-20 requests, though the optimal number heavily depends on the lengths of the prompts and the generated tokens. If you don't encounter the warning message, you may be able to further enhance efficiency by gradually increasing the number of requests.
We understand that tuning the number of requests isn't always straightforward, and we're considering either automating this adjustment or at least making it easier in future versions.
vLLM implements swapping (Chapter 4.5 of vLLM paper) as an alternative to recomputing if no space could be allocated for KV cache of new tokens. Would MII implement KV cache swapping?
Hi, any update on this issue.
I am also getting same issue even with a batch size 1 (using 2 x A100 80GB) but when I am using a single A100 80GB I am able to run even with higher batches
@canamika27 I think #403 resolved the issue. Can you try the latest version?
@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The batch size is 1
The batch size is 1
The total time is 3.23 secs
The batch size is 1
The total time is 4.21 secs
The batch size is 1
The total time is 3.15 secs
The batch size is 1
The total time is 3.15 secs
The batch size is 1
Traceback (most recent call last):
File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in
One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.
@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665
One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.
I got the same error.
@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665
One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.
same error
@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665
One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.
I have the same problem.
Any Update on this? I am also getting the same error.
Any workaround for the new problem? @arashb Sorry for ping, can you help?
I just want to do inference serially, but got this error after exactly 3 pipeline calls. Stably repro with mixtral8x7b
on two machines.
Related issue: #497