vllm icon indicating copy to clipboard operation
vllm copied to clipboard

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`

Open handsomelys opened this issue 1 year ago • 35 comments

I followed the Quickstart tutorial and deployed the Chinese-llama-alpaca-2 model using vllm, and I got the following error. ***@***:~/Code/experiment/***/ToG$ CUDA_VISIBLE_DEVICES=0 python load_llm.py INFO 01-11 15:51:02 llm_engine.py:70] Initializing an LLM engine with config: model='/home/***/***/models/alpaca-2', tokenizer='/home/***/***/models/alpaca-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 01-11 15:51:18 llm_engine.py:275] # GPU blocks: 229, # CPU blocks: 512 Traceback (most recent call last): File "load_llm.py", line 8, in <module> llm = LLM(model='/home/***/***/models/alpaca-2') File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 105, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args engine = cls(*engine_configs, File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 114, in __init__ self._init_cache() File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

my code is:

from vllm import LLM, SamplingParams

prompts = [
    "hello, who is you?",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model='/home/b3432/***/models/alpaca-2')
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Genrate text: {generated_text!r}")

What's going on and what do I need to do to fix the error? I run the code with RTX3090(24G) * 1. Looking forward to a reply!

handsomelys avatar Jan 11 '24 08:01 handsomelys

same error..

set gpu_memory_utilization=0.75 and low max_model_len ,

but resp is too short...

chopin1998 avatar Jan 11 '24 10:01 chopin1998

Having the same issue running CodeLLaMa 13b instruct hf with the langchain integration for vLLM.

The model's max seq len (16384) is larger than the maximum number of tokens that can be stored in KV cache (11408). Try increasinggpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine. (type=value_error)

ishand0101 avatar Jan 11 '24 20:01 ishand0101

same error.

byerose avatar Jan 12 '24 09:01 byerose

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26064). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Mistral-7B-v0.1

gree2 avatar Jan 13 '24 09:01 gree2

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

aklakl avatar Jan 14 '24 19:01 aklakl

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Set max_model_len< KV cache. It works.

byerose avatar Jan 17 '24 12:01 byerose

I wrote fixed value max_model_len.

vllm/config.py: 104 # self.max_model_len = _get_and_verify_max_len(self.hf_config, # max_model_len) self.max_model_len = 4096

Bambuuai avatar Jan 25 '24 15:01 Bambuuai

I have the same issue here

ZhangzihanGit avatar Feb 06 '24 23:02 ZhangzihanGit

i am haivng this problem with this:

python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002

but we get this:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

is there a work around to launch this form the command line?

silvacarl2 avatar Feb 19 '24 17:02 silvacarl2

i am haivng this problem with this:

python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002

but we get this:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

is there a work around to launch this form the command line?

Yes, it looks like you can add --max_model_len 4096 to your command.

https://github.com/vllm-project/vllm/blob/e433c115bce2bf27f7b1abdde7029566007d9eee/vllm/engine/arg_utils.py#L22

mhillebrand avatar Feb 19 '24 18:02 mhillebrand

thx, will try that!

silvacarl2 avatar Feb 19 '24 18:02 silvacarl2

Oops. You'll wanna use hypens and not underscores.

https://github.com/vllm-project/vllm/blob/e433c115bce2bf27f7b1abdde7029566007d9eee/vllm/engine/arg_utils.py#L143

mhillebrand avatar Feb 19 '24 18:02 mhillebrand

yup found that LOL!

silvacarl2 avatar Feb 19 '24 19:02 silvacarl2

Same error,same solving way,weird.. Why'd they have initalized this variables too large?

ElinLiu0 avatar Feb 23 '24 07:02 ElinLiu0

Is there a solution to this problem now? I still encounter this problem on gemma-7b.

Nuclear6 avatar Feb 27 '24 12:02 Nuclear6

Is there a solution to this problem now? I still encounter this problem on gemma-7b.

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.

ElinLiu0 avatar Feb 27 '24 12:02 ElinLiu0

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗?我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题,只需继续观察日志,然后使您计算机上的 Q,K,V 缓存仍然保留,以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.

Nuclear6 avatar Feb 27 '24 12:02 Nuclear6

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗?我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题,只需继续观察日志,然后使您计算机上的 Q,K,V 缓存仍然保留,以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.

ElinLiu0 avatar Feb 27 '24 12:02 ElinLiu0

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗?我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题,只需继续观察日志,然后使您计算机上的 Q,K,V 缓存仍然保留,以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型,还支持很多其他大型模型。是机器配置的原因吗?这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴,我目前使用阿里云 Qwen1.5-7B-INT4,通过将 model_length 设置为 1024,它按预期工作正常。

My guess is that the machine configuration is incorrect. image

image

Nuclear6 avatar Feb 27 '24 12:02 Nuclear6

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗?我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题,只需继续观察日志,然后使您计算机上的 Q,K,V 缓存仍然保留,以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型,还支持很多其他大型模型。是机器配置的原因吗?这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴,我目前使用阿里云 Qwen1.5-7B-INT4,通过将 model_length 设置为 1024,它按预期工作正常。

My guess is that the machine configuration is incorrect. image

image

What's your tool using now,looks pretty cool

ElinLiu0 avatar Feb 27 '24 12:02 ElinLiu0

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗?我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题,只需继续观察日志,然后使您计算机上的 Q,K,V 缓存仍然保留,以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型,还支持很多其他大型模型。是机器配置的原因吗?这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴,我目前使用阿里云 Qwen1.5-7B-INT4,通过将 model_length 设置为 1024,它按预期工作正常。

My guess is that the machine configuration is incorrect. image

image

抱歉我才看到你翻译中文,不好意思 002B9DB5

ElinLiu0 avatar Feb 27 '24 12:02 ElinLiu0

https://rahulschand.github.io/gpu_poor/

Nuclear6 avatar Feb 27 '24 12:02 Nuclear6

Try to change gpu_memory_utilization=0.95 or 1.0 for vllm. Then it will run successfully.

DsnTgr avatar Mar 02 '24 06:03 DsnTgr

gpu_memory_utilization

not work, Can you post the modified files and code?

Nuclear6 avatar Mar 05 '24 13:03 Nuclear6

https://github.com/vllm-project/vllm/blob/24aecf421a4ad5989697010963074904fead9a1b/vllm/engine/arg_utils.py#L30 https://github.com/vllm-project/vllm/blob/24aecf421a4ad5989697010963074904fead9a1b/vllm/entrypoints/llm.py#L51 https://github.com/vllm-project/vllm/blob/24aecf421a4ad5989697010963074904fead9a1b/vllm/entrypoints/llm.py#L51

Code

from vllm import LLM, SamplingParams

llm = LLM(model="HuggingFaceH4/zephyr-7b-beta", gpu_memory_utilization=0.95)

...

DsnTgr avatar Mar 06 '24 07:03 DsnTgr

it is work that I run this model with huggingface or vllm in RTX4090. And I also use google/gemma-7b with hf to work successfully.

DsnTgr avatar Mar 06 '24 07:03 DsnTgr

Hello,

I used to use the same engine as follow: python -m vllm.entrypoints.openai.api_server --model="codellama/CodeLlama-13b-Instruct-hf" --tensor-parallel-size=2

With 2 NVIDIA L4 GPUs it now shows the same error: ValueError: The model's max seq len (16384) is larger than the maximum number of tokens that can be stored in KV cache (14528). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Why and how should I return to the previous configuration setting? I already ran a set of experiences on the last configuration, and I must maintain the same.

SafeyahShemali avatar Mar 12 '24 15:03 SafeyahShemali

I see the code self.max_num_batched_tokens = max(max_model_len, 2048) from https://github.com/vllm-project/vllm/blob/e221910e77087743a50560e4ae69c3c2a12beb53/vllm/config.py#L486 and "model_max_length": 1000000000000000019884624838656, from https://huggingface.co/codellama/CodeLlama-13b-hf/blob/main/tokenizer_config.json

Maybe you changed the max_model_len like https://github.com/vllm-project/vllm/issues/322#issuecomment-1874997867, but I'm not sure.

DsnTgr avatar Mar 13 '24 07:03 DsnTgr

I am unsure if this would suit me as I need to keep the engine setting the same for the whole experiment.

Could anyone clarify this point if this trick won't change the model performance (inference part)?

SafeyahShemali avatar Mar 14 '24 06:03 SafeyahShemali

is max_model_len=2048arbitrary or just simple the max number of tokens i cen expect to inference?

silvacarl2 avatar Apr 04 '24 18:04 silvacarl2