vllm
                                
                                 vllm copied to clipboard
                                
                                    vllm copied to clipboard
                            
                            
                            
                        OutOfMemoryError
Hello! New(old) problem 🙂
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 17.94 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 12.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Command:
python -m vllm.entrypoints.openai.api_server 
--model mistralai/Mixtral-8x7B-v0.1
Any tips?
I am having similar issue, with 2 x A6000 (so 96 GB VRAM in total).
I am running vllm/vllm-openai docker image and I am initialising it with: Docker options: --runtime nvidia --gpus all -v ./workspace:/root/.cache/huggingface -p 8000:8000 --ipc=host OnStart script: python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --max-model-len 8000
I tried reducing the model sequence length, but still I am unable to fit it in 96GB, it runs fine with 128GB. Can anyone advise what needs to be done to make it work, do I need to use quantization, I know that vllm has issue with awq and mistral on multi GPU?
Below is log from vLLM start Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0) INFO 01-05 10:44:10 llm_engine.py:275] # GPU blocks: 0, # CPU blocks: 4096 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/vllm/entrypoints/openai/api_server.py", line 737, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/workspace/vllm/engine/async_llm_engine.py", line 500, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/workspace/vllm/engine/async_llm_engine.py", line 273, in init self.engine = self._init_engine(*args, **kwargs) File "/workspace/vllm/engine/async_llm_engine.py", line 318, in _init_engine return engine_class(*args, **kwargs) File "/workspace/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/workspace/vllm/engine/llm_engine.py", line 279, in _init_cache raise ValueError("No available memory for the cache blocks. " ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.
I also encountered the same problem.
increase gpu_memory_utilization to 0.95 or 1
Also seeing this using A100 with Mixtral
Is anyone able to run it on 4 A10 GPUs? 4*24GB=96GB
@shixianc I've tried using 2 x A100 80GB GPUs and no luck. But when I do so, I get a different error, same one as outlined here: https://github.com/vllm-project/vllm/issues/1116.
So if I leave out --tensor-parallel-size flag, I get the error mentioned in this issue. And if I add it, I get the GPU count error (despite seeing the GPUs with nvidia-smi)
https://github.com/vllm-project/vllm/issues/2413
This may be helpful.
Realized there was a Kubernetes config issue on my end, Im able to run mixtral properly now using 2 X A100 80GB GPUs, sorry about that!
Realized there was a Kubernetes config issue on my end, Im able to run mixtral properly now using 2 X A100 80GB GPUs, sorry about that!
Could you please share more detail on how you fix your problem? I have 4 * A100 with mixtral but still encounter this issue