vllm
vllm copied to clipboard
Long context will cause the vLLM stop
If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users can easily modify the code to resume the vLLM from where it left off.
HI, I think this is mentioned in #273, you can refer to that PR and see the progress.
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
Hi! I think this is because you set the max_num_batched_tokens and max_num_seq too big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
Hi! I think this is because you set the
max_num_batched_tokensandmax_num_seqtoo big to act as metrics. If you have a request which needs more GPU memory than you have, it cannot be actually processed. But the above two values are too big to filter this request out so the request will still stay in waiting queue. The repository maintainers may give a better explaination on this.
Thanks for reply. But I have to set max_num_batched_tokens big to fit the long context input. In fact it is not that long because I have tested the huggingface version that it could deal with up to 8k context input in the same device. I think if it works in hf setting but does not work in vllm, it may expose some problems. I wonder there are some other solutions for long context.
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?
I also encountered vllm hangs problem and it seems vllm falls into an endless loop.
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
Hi, can you show me "vLLM will fall into an endless loop and not use GPU" happens at what code postion?
I also encountered vllm hangs problem and it seems vllm falls into an endless loop.
I guess it happens in _schedule() function in scheduler.py
Hello, the question seems to remain unsolved. When I set max_num_batched_tokens very big (such as 10000) or the length of input tokens is quite long (near 10000), vLLM will fall into an endless loop and not use GPU. Is there any solution for long context situation? @LinPoly @zhuohan123
the same, when i ran baichuan13b using vllm.entrypoints.api_server, it hang abruptly no more gpu using. no any message but server was dead.
@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.
@canghongjian @dalong2hongmei Hello, have you solved it? I have also experienced vllm-v0.1.3 hanging issues twice, and it's like a deadlock problem. It's hard to locate code position.
It's probably caused by vllm itself. I encountered this question again several days ago and found the maximum length that would not trigger this endless loop rises with the available GPU memory increases. Particularly, it is around 7000 in 15G T4 and 20000 in 22G A10.
Probably this: https://github.com/vllm-project/vllm/issues/546
For the record, I wasn't able to fix this particular issue.
Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it
Mine might be a different problem. I've reproduced it. If there are many interrupt requests to abort, the vllm service will not pass the pressure test because fall into an endless loop. I don't know exactly why, but here is occurred error code postion: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L160 @zhuohan123 Can you confirm it
I have fixed the problem.