vllm Long context will cause the vLLM stop

If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users can easily modify the code to resume the vLLM from where it left off.

Jun 28 '23 07:06 sunyuhan19981208

HI, I think this is mentioned in #273, you can refer to that PR and see the progress.

Jun 28 '23 12:06 LinPoly

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Jul 11 '23 08:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi! I think this is because you set the max_num_batched_tokens and max_num_seq too big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.

Jul 12 '23 12:07 LinPoly

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi! I think this is because you set the max_num_batched_tokens and max_num_seq too big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.

Thanks for reply. But I have to set max_num_batched_tokens big to fit the long context input. In fact it is not that long because I have tested the huggingface version that it could deal with up to 8k context input in the same device. I think if it works in hf setting but does not work in vllm, it may expose some problems. I wonder there are some other solutions for long context.

Jul 13 '23 03:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?

I also encountered vllm hangs problem and it seems vllm falls into an endless loop.

Jul 27 '23 10:07 David-Lee-1990

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?

I also encountered vllm hangs problem and it seems vllm falls into an endless loop.

I guess it happens in _schedule() function in scheduler.py

Jul 29 '23 07:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

the same, when i ran baichuan13b using vllm.entrypoints.api_server, it hang abruptly no more gpu using. no any message but server was dead.

Aug 24 '23 09:08 dalong2hongmei

@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.

Aug 25 '23 09:08 tbup

@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.

It's probably caused by vllm itself. I encountered this question again several days ago and found the maximum length that would not trigger this endless loop rises with the available GPU memory increases. Particularly, it is around 7000 in 15G T4 and 20000 in 22G A10.

Aug 25 '23 10:08 canghongjian

Probably this: https://github.com/vllm-project/vllm/issues/546

For the record, I wasn't able to fix this particular issue.

Aug 25 '23 11:08 syskn

Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it

Aug 26 '23 01:08 tbup

Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it

I have fixed the problem.

Aug 27 '23 14:08 tbup

vllm vllm copied to clipboard

Long context will cause the vLLM stop

vllm
vllm copied to clipboard