vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]: What's the minimum VRAM needed to use entire context length for Llama 3.1 70B and 405B

Open aflah02 opened this issue 1 year ago • 1 comments

Your current environment

Libraries Installed -

"vllm==0.5.5",
"torch==2.4.0",
"transformers==4.44.2",
"ray",
"hf-transfer",
"huggingface_hub"

How would you like to use vllm

Hi I want to run Llama 3.1 70B and 405B with 120K context length. I have access to several 8xH100 nodes however most tutorial code snippets give errors of the style ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (17840). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine. I want to get an estimate of how many nodes each having 8 H100s do I need for both the models to get enough VRAM to run both the models at full context length.

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

aflah02 avatar Sep 05 '24 10:09 aflah02

are you using '--tensor-parallel-size 8'? 17840 seems to be small (at least for 70b) I basically have same question. I was getting ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (118208). when trying to run Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf on 80 or 96Gb so I tried 160Gb and it stopped complaining, but when I try to send really large request vLLM crashes

(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]     return input_, ~vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]                    ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] 
INFO 09-10 11:08:33 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:08:43 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:08:53 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:03 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:13 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:23 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
ERROR 09-10 11:09:29 async_llm_engine.py:960] Engine iteration timed out. This should never happen!
ERROR 09-10 11:09:29 async_llm_engine.py:63] Engine background task failed
ERROR 09-10 11:09:29 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
ERROR 09-10 11:09:29 async_llm_engine.py:63]     done, _ = await asyncio.wait(
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 09-10 11:09:29 async_llm_engine.py:63]     return await _wait(fs, timeout, return_when, loop)
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 495, in _wait
ERROR 09-10 11:09:29 async_llm_engine.py:63]     await waiter
ERROR 09-10 11:09:29 async_llm_engine.py:63] asyncio.exceptions.CancelledError
ERROR 09-10 11:09:29 async_llm_engine.py:63] 
ERROR 09-10 11:09:29 async_llm_engine.py:63] During handling of the above exception, another exception occurred:
ERROR 09-10 11:09:29 async_llm_engine.py:63] 
ERROR 09-10 11:09:29 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-10 11:09:29 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
ERROR 09-10 11:09:29 async_llm_engine.py:63]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 09-10 11:09:29 async_llm_engine.py:63]     self._do_exit(exc_type)
ERROR 09-10 11:09:29 async_llm_engine.py:63]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 09-10 11:09:29 async_llm_engine.py:63]     raise asyncio.TimeoutError
ERROR 09-10 11:09:29 async_llm_engine.py:63] asyncio.exceptions.TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x7fb22bd31a20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa8341295a0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fb22bd31a20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa8341295a0>>)>
Traceback (most recent call last):
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 495, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-10 11:09:29 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-10 11:09:29 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-10 11:09:29 client.py:412] Traceback (most recent call last):
ERROR 09-10 11:09:29 client.py:412]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-10 11:09:29 client.py:412]     await self.check_health(socket=socket)
ERROR 09-10 11:09:29 client.py:412]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-10 11:09:29 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-10 11:09:29 client.py:412]   File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-10 11:09:29 client.py:412]     raise response
ERROR 09-10 11:09:29 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
  File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/opt/conda/lib/python3.10/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f7266215ae0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    await response(scope, receive, send)
  File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 250, in __call__
    async with anyio.create_task_group() as task_group:
  File "/root/vllm_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

dmatora avatar Sep 10 '24 11:09 dmatora

@dmatora same problem when running Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf on A100 80GB

studyww0 avatar Sep 23 '24 07:09 studyww0

Well ollama got KV cache quantisation support and Qwen released 2.5 models, which 32B outsmarting Llama 3.1 70B so now I don't need multiple A100 to get GPT4 brain level, I can do that on my machine with single 3090 at 24Gb

dmatora avatar Sep 23 '24 14:09 dmatora

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Dec 23 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jan 23 '25 01:01 github-actions[bot]