vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Long context will cause the vLLM stop

Open sunyuhan19981208 opened this issue 2 years ago • 1 comments

If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users can easily modify the code to resume the vLLM from where it left off.

sunyuhan19981208 avatar Jun 28 '23 07:06 sunyuhan19981208

HI, I think this is mentioned in #273, you can refer to that PR and see the progress.

LinPoly avatar Jun 28 '23 12:06 LinPoly

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

canghongjian avatar Jul 11 '23 08:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi! I think this is because you set the max_num_batched_tokens and max_num_seq too big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.

LinPoly avatar Jul 12 '23 12:07 LinPoly

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi! I think this is because you set the max_num_batched_tokens and max_num_seq too big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.

Thanks for reply. But I have to set max_num_batched_tokens big to fit the long context input. In fact it is not that long because I have tested the huggingface version that it could deal with up to 8k context input in the same device. I think if it works in hf setting but does not work in vllm, it may expose some problems. I wonder there are some other solutions for long context.

canghongjian avatar Jul 13 '23 03:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?

I also encountered vllm hangs problem and it seems vllm falls into an endless loop.

David-Lee-1990 avatar Jul 27 '23 10:07 David-Lee-1990

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?

I also encountered vllm hangs problem and it seems vllm falls into an endless loop.

I guess it happens in _schedule() function in scheduler.py

canghongjian avatar Jul 29 '23 07:07 canghongjian

Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123

the same, when i ran baichuan13b using vllm.entrypoints.api_server, it hang abruptly no more gpu using. no any message but server was dead.

dalong2hongmei avatar Aug 24 '23 09:08 dalong2hongmei

@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.

tbup avatar Aug 25 '23 09:08 tbup

@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.

It's probably caused by vllm itself. I encountered this question again several days ago and found the maximum length that would not trigger this endless loop rises with the available GPU memory increases. Particularly, it is around 7000 in 15G T4 and 20000 in 22G A10.

canghongjian avatar Aug 25 '23 10:08 canghongjian

Probably this: https://github.com/vllm-project/vllm/issues/546

For the record, I wasn't able to fix this particular issue.

syskn avatar Aug 25 '23 11:08 syskn

Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it

tbup avatar Aug 26 '23 01:08 tbup

Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it

I have fixed the problem.

tbup avatar Aug 27 '23 14:08 tbup