vllm OutOfMemoryError

Hello! New(old) problem 🙂

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 17.94 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 38.72 GiB is allocated by PyTorch, and 12.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Command: python -m vllm.entrypoints.openai.api_server
--model mistralai/Mixtral-8x7B-v0.1

Any tips?

Jan 06 '24 18:01 Hobrus

I am having similar issue, with 2 x A6000 (so 96 GB VRAM in total).

I am running vllm/vllm-openai docker image and I am initialising it with: Docker options: --runtime nvidia --gpus all -v ./workspace:/root/.cache/huggingface -p 8000:8000 --ipc=host OnStart script: python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --max-model-len 8000

I tried reducing the model sequence length, but still I am unable to fit it in 96GB, it runs fine with 128GB. Can anyone advise what needs to be done to make it work, do I need to use quantization, I know that vllm has issue with awq and mistral on multi GPU?

Below is log from vLLM start Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0) INFO 01-05 10:44:10 llm_engine.py:275] # GPU blocks: 0, # CPU blocks: 4096 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/vllm/entrypoints/openai/api_server.py", line 737, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/workspace/vllm/engine/async_llm_engine.py", line 500, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/workspace/vllm/engine/async_llm_engine.py", line 273, in init self.engine = self._init_engine(*args, **kwargs) File "/workspace/vllm/engine/async_llm_engine.py", line 318, in _init_engine return engine_class(*args, **kwargs) File "/workspace/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/workspace/vllm/engine/llm_engine.py", line 279, in _init_cache raise ValueError("No available memory for the cache blocks. " ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

Jan 07 '24 10:01 dinonovak

I also encountered the same problem.

Jan 08 '24 02:01 double-vin

increase gpu_memory_utilization to 0.95 or 1

Jan 09 '24 06:01 Rahmat711

Also seeing this using A100 with Mixtral

Jan 10 '24 21:01 francescov1

Is anyone able to run it on 4 A10 GPUs? 4*24GB=96GB

Jan 10 '24 21:01 shixianc

@shixianc I've tried using 2 x A100 80GB GPUs and no luck. But when I do so, I get a different error, same one as outlined here: https://github.com/vllm-project/vllm/issues/1116.

So if I leave out --tensor-parallel-size flag, I get the error mentioned in this issue. And if I add it, I get the GPU count error (despite seeing the GPUs with nvidia-smi)

Jan 11 '24 17:01 francescov1

https://github.com/vllm-project/vllm/issues/2413

This may be helpful.

Jan 12 '24 06:01 arkohut

Realized there was a Kubernetes config issue on my end, Im able to run mixtral properly now using 2 X A100 80GB GPUs, sorry about that!

Jan 16 '24 20:01 francescov1

Realized there was a Kubernetes config issue on my end, Im able to run mixtral properly now using 2 X A100 80GB GPUs, sorry about that!

Could you please share more detail on how you fix your problem? I have 4 * A100 with mixtral but still encounter this issue

Mar 18 '24 07:03 geknow

vllm vllm copied to clipboard

OutOfMemoryError

vllm
vllm copied to clipboard