vllm Prompt size limits? It keeps hanging with prompts longer than 120 tokens

Are there any prompt size limits? It seems that using more than 120 words make the model unresponsive. Check the following case. In the first try I used 112 words in prompt and worked just fine.

Screenshot 2023-06-27 103239

Then I tried increasing the prompt's size to 140 words and it just came unresponsive. and had to kill it.

Screenshot 2023-06-27 105213

The same behavior is reproduced every time I try to give a prompt larger than 120 - 130 words.

Probably this is why it doen't work with more complex chains (RetrievalQA) and LlamaIndex #233 Is there any idea?

Jun 27 '23 07:06 ktolias

Hi! What's the maximum number of tokens that your model can support? Is 120-130 words beyond that limit?

BTW, it is a known issue (#113) that there is no explicit error message when the request is too long in vLLM. It's being actively fixed (#273).

Jun 27 '23 16:06 zhuohan123

Hi, the model is Guanaco-7B, which is capable of handling more than 2000 tokens. More specifically I tried "KBlueLeaf/guanaco-7B-leh" and "TheBloke/guanaco-7B-HF".

Jun 27 '23 16:06 ktolias

@zhuohan123 Hi, I use the vicuna_7b_1.3 to inference. My input text is very long, more than 1500 tokens, but I limited the maximum length to 2048, but after entering multiple inferences in one line, I found that the last one or two may be out of order or output the same token repeatedly.Is this normal? Could it be caused by the fact that the server's cpu is occasionally called by other processes?

Jun 28 '23 05:06 xcxhy

@ktolias, it seems you are counting "tokens" at the word level (using space tokenizer). That would not reflect the true token count since practically some sub-word level tokenizer is being used. You should use the actual internal tokenizer to tokenize and then check lengths. Your actual token length is probably much higher than 112.

Jun 28 '23 10:06 JRC1995

@JRC1995 Thanks for your reply. I understand what you are saying about the difference between words and tokens. Nevertheless, this is not what's causing the problem. Text is being tokenized by the internal tokenizer (decapoda-research/llama-7b-hf). The space tokenizer was used just to make the length of the text more clear. There is no chance that the sequence length (token count) is larger than what the model can handle.

Just to make my self clear, I've tested the model natively (without vllm) and it was able to cope with 10x larger input. There must be something with the way that vllm handles the prompt. Please mind that the model is able to handle the same total length of tokens, if the prompt is short and the output is large. The problem arises when we have the opposite situation. Long prompt -> no answer (hangs).

Thank you for your efforts and I'm positive that you will soon provide a fix, or at least an explanation about this behaviour.

Anyhow, it is too easy to reproduce the problem. You just have to wrap one of the models ("KBlueLeaf/guanaco-7B-leh", "TheBloke/guanaco-7B-HF") with vllm. Then try to perform inference with a large prompt.

Jun 28 '23 13:06 ktolias

@ktolias Does LLM("model", max_num_seqs=512) help ?

Jul 03 '23 09:07 aliswel-mt

@Yung-Kai Thanks for the tip. Initially, I faced the same problem even with max_num_seqs. As my last hope, I tried to use Colab Pro+ with an A100 40GB.

That was it!! It now works and it's blazing fast! Congrats you guys!!

Jul 03 '23 17:07 ktolias

It seems that in my case it was GPU memory. When I used the A100 40GB (Colab Pro+) the model works perfectly.

Jul 03 '23 17:07 ktolias