DeepSpeed-MII Deadlock detected

Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations.

Constantly see this issue when running below on a A100(40GiB) for llama2-7b.

    import mii
    from deepspeed.inference import RaggedInferenceEngineConfig, DeepSpeedTPConfig
    from deepspeed.inference.v2.ragged import DSStateManagerConfig

    tp_config = DeepSpeedTPConfig(tp_size=tensor_parallel)
    mgr_config = DSStateManagerConfig(max_ragged_batch_size=1024,
                                      max_ragged_sequence_count=1024)
    inference_config = RaggedInferenceEngineConfig(tensor_parallel=tp_config,
                                                   state_manager=mgr_config)
    llm = mii.serve(
        model,
        deployment_name='mii',
        tensor_parallel=tensor_parallel,
        inference_engine_config=inference_config,
        replica_num=1,
        task='text-generation'
    )
    outputs = llm.generate(prompts,
                           do_sample=False,
                           top_p=1.0,
                           max_new_tokens=max_new_tokens)

Dec 03 '23 21:12 flexwang

The deadlock is caused when we detect that we are not making any progress on any of the generation tasks. This can happen for a few reasons, including lots of concurrent generation requests, very long sequences, or limited GPU memory. Our current solution for this will hurt performance if you are seeing it often. How many requests are you sending to the server at once time?

Also, I believe @tohtana is working on an improved solution to this problem.

Dec 04 '23 17:12 mrwyattii

I am sending a few hundred requests within one batch.

Dec 04 '23 18:12 flexwang

I am sending a few hundred requests within one batch.

If these requests are generating lots of tokens, then sending this many at once will definitely cause the deadlock situation. If you can send the requests in smaller batches, that would avoid the problem. However, I will let @tohtana comment on any upcoming changes that will allow users to send large batches of requests at once!

Dec 05 '23 21:12 mrwyattii

Hi @flexwang, DeepSpeed-FastGet (MII) allocates KV cache for all requests that are processed in a batch. To avoid this warning, a simple workaround is to reduce the number of requests in a batch. In your case, I recommend starting with 10-20 requests, though the optimal number heavily depends on the lengths of the prompts and the generated tokens. If you don't encounter the warning message, you may be able to further enhance efficiency by gradually increasing the number of requests.

We understand that tuning the number of requests isn't always straightforward, and we're considering either automating this adjustment or at least making it easier in future versions.

Dec 06 '23 00:12 tohtana

vLLM implements swapping (Chapter 4.5 of vLLM paper) as an alternative to recomputing if no space could be allocated for KV cache of new tokens. Would MII implement KV cache swapping?

Dec 19 '23 12:12 Tan-YiFan

Hi, any update on this issue.

I am also getting same issue even with a batch size 1 (using 2 x A100 80GB) but when I am using a single A100 80GB I am able to run even with higher batches

Feb 23 '24 16:02 canamika27

@canamika27 I think #403 resolved the issue. Can you try the latest version?

Feb 23 '24 22:02 tohtana

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

Feb 26 '24 06:02 canamika27

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

I got the same error.

Mar 19 '24 03:03 zoyopei

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

same error

Mar 22 '24 02:03 geoyg

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

I have the same problem.

Mar 23 '24 11:03 aspiridon0v

Any Update on this? I am also getting the same error.

Mar 27 '24 21:03 prabin525

Any workaround for the new problem? @arashb Sorry for ping, can you help?

I just want to do inference serially, but got this error after exactly 3 pipeline calls. Stably repro with mixtral8x7b on two machines.

Related issue: #497

Jul 30 '24 02:07 seven-mile

DeepSpeed-MII DeepSpeed-MII copied to clipboard

Deadlock detected

DeepSpeed-MII
DeepSpeed-MII copied to clipboard