vllm ray OOM in tensor parallel

In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve this problem? my model is yahma/llama-7b-hf my transformers version is 4.28.0 my cuda version is 11.4

2023-06-30 09:24:53,455 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2023-06-30 09:24:53,459 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.12gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2023-06-30 09:24:53,584 INFO worker.py:1636 -- Started a local Ray instance. INFO 06-30 09:24:54 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) WARNING 06-30 09:24:54 config.py:131] Possibly too large swap space. 16.00 GiB out of the 32.00 GiB total CPU memory is allocated for the swap space. /opt/app/yahma-llama-lora Exception in thread ray_print_logs: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 900, in print_logs global_worker_stdstream_dispatcher.emit(data) File "/usr/local/lib/python3.8/dist-packages/ray/_private/ray_logging.py", line 264, in emit handle(data) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1788, in print_to_stdstream print_worker_logs(batch, sink) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1950, in print_worker_logs restore_tqdm() File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1973, in restore_tqdm tqdm_ray.instance().unhide_bars() File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 344, in instance _manager = _BarManager() File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 256, in init self.should_colorize = not ray.widgets.util.in_notebook() File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 205, in in_notebook shell = _get_ipython_shell_name() File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 194, in _get_ipython_shell_name import IPython File "/usr/local/lib/python3.8/dist-packages/IPython/init.py", line 30, in raise ImportError( ImportError: IPython 8.13+ supports Python 3.9 and above, following NEP 29. IPython 8.0-8.12 supports Python 3.8 and above, following NEP 29. When using Python 2.7, please install IPython 5.x LTS Long Term Support version. Python 3.3 and 3.4 were supported up to IPython 6.x. Python 3.5 was supported with IPython 7.0 to 7.9. Python 3.6 was supported with IPython up to 7.16. Python 3.7 was still supported with the 7.x branch.

See IPython README.rst file for more information:

https://github.com/ipython/ipython/blob/main/README.rst

Traceback (most recent call last): File "", line 1, in File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init self._init_cache() File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache num_blocks = self._run_workers( File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers all_outputs = ray.get(all_outputs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 10.30.192.36, ID: 17400c6c9eee3bc1384c172eecd4e1ecf2992cbc7f50cb27d2dc60d7) where the task (task ID: ffffffffffffffff283e91f20257d747969124a201000000, name=Worker.init, pid=26332, memory used=4.54GB) was running was 31.27GB / 32.00GB (0.977298), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 26333 4.60 ray::Worker.__init__ 26332 4.54 ray::Worker.__init__ 26331 4.51 ray::Worker.__init__ 26330 4.47 ray::Worker.__init__ 25044 0.23 python 25099 0.19 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 25340 0.06 ray::IDLE 25174 0.06 /usr/bin/python /usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 -... 25310 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 25349 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

Jun 30 '23 09:06 liulfy

Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.

Jun 30 '23 09:06 WoosukKwon

@WoosukKwon Thank you for answering my problem! When I try the swap_space, the problem has not been solved. my code is here: from vllm import LLM model_path = 'yahma/llama-13b-hf' llama_model = LLM(model = model_path, tensor_parallel_size=4, swap_space=1)

my CPU has 32GB memory, and I use 4 A100 40GB. and the error message is still the same: 2023-07-03 03:27:55,908 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2023-07-03 03:27:55,911 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.08gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2023-07-03 03:27:56,045 INFO worker.py:1636 -- Started a local Ray instance. INFO 07-03 03:27:56 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) /opt/app/yahma-llama-lora Traceback (most recent call last): File "", line 1, in File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init self._init_cache() File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache num_blocks = self._run_workers( File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers all_outputs = ray.get(all_outputs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 10.30.192.36, ID: 91847a2262e263f96264497d39d4641c385303a97ff78e3fc6f0e721) where the task (task ID: ffffffffffffffff27a08d091fe239dc78e7cd0c01000000, name=Worker.init, pid=51664, memory used=4.45GB) was running was 31.21GB / 32.00GB (0.97518), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 51660 4.45 ray::Worker.__init__ 51664 4.45 ray::Worker.__init__ 51658 4.42 ray::Worker.__init__ 51662 4.41 ray::Worker.__init__ 45071 0.27 python 50443 0.18 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 50650 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 50694 0.05 ray::IDLE 50681 0.05 ray::IDLE 50688 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

Jul 03 '23 03:07 liulfy

I met same problem.

model：

25G	./llama-13b-lora-hf

free -h

              total        used        free      shared  buff/cache   available
Mem:           31Gi       2.1Gi        26Gi       0.0Ki       2.3Gi        28Gi
Swap:         8.0Gi       1.2Gi       6.8Gi

Initializing an LLM engine with config:

model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0

This is error:

2023-07-04 20:26:39,125	INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 192.168.1.240:6379...
2023-07-04 20:26:39,141	INFO worker.py:1636 -- Connected to Ray cluster.
INFO 07-04 20:26:39 llm_engine.py:60] Initializing an LLM engine with config: model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)
INFO 07-04 20:26:39 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Traceback (most recent call last):
  File "api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 232, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 55, in __init__
    self.engine = engine_class(*args, **kwargs)
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 105, in __init__
    self._init_cache()
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 117, in _init_cache
    num_blocks = self._run_workers(
  File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 334, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/worker.py", line 2542, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.240, ID: 3d1f1d89602340cb023e506de2a0dd5eb353e2ec29b8800cdf553655) where the task (task ID: ffffffffffffffffaece4988873caddc35d289400c000000, name=Worker.__init__, pid=3290289, memory used=13.74GB) was running was 29.74GB / 31.17GB (0.954021), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.240`. To see the logs of the worker, use `ray logs worker-078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa*out -ip 192.168.1.240. Top 10 memory users:
PID	MEM(GB)	COMMAND
3290289	13.74	ray::Worker.__init__
3290288	13.67	ray::Worker.__init__
3290170	0.21	python api_server.py --model /data/ketadb/text-generation-webui/models/llama-13b-lora-hf/ --tokenize...
3290084	0.11	/home/ubuntu/ketad/agent/subprocess/bin/keta-agent/keta-agent -c keta-agent.yaml
3220981	0.02	ray::IDLE
3219445	0.02	/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_...
3220978	0.02	ray::IDLE
3220980	0.02	ray::IDLE
3220979	0.02	ray::IDLE
3220990	0.02	ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

I guess vllm allocate memory size for model more than it's physical size，Is there a formula for calculating memory size？

Jul 03 '23 22:07 jibowang

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

Jul 04 '23 03:07 CtfGo

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0 work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor

related issue: https://github.com/ray-project/ray/issues/10895

Jul 04 '23 03:07 CtfGo

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0 work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor

related issue: ray-project/ray#10895

This does not work for me. I set NCCL_DEBUG=INFO and my log is as follows: 2023-07-04 08:55:35,247 INFO utils.py:573 -- Detected RAY_USE_MULTIPROCESSING_CPU_COUNT=1: Using multiprocessing.cpu_count() to detect the number of CPUs. This may be inconsistent when used inside docker. To correctly detect CPUs, unset the env var: RAY_USE_MULTIPROCESSING_CPU_COUNT. 2023-07-04 08:55:35,252 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.75gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2023-07-04 08:55:35,377 INFO worker.py:1636 -- Started a local Ray instance. INFO 07-04 08:55:37 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) /opt/app/yahma-llama-lora Traceback (most recent call last): File "", line 1, in File "/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py", line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 151, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 102, in init self._init_cache() File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 114, in _init_cache num_blocks = self._run_workers( File "/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py", line 317, in _run_workers all_outputs = ray.get(all_outputs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2542, in get raise value ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=26149, ip=10.30.192.153, actor_id=c84d115691be21b19ce79faa01000000, repr=<vllm.worker.worker.Worker object at 0x7f989c09ffd0>) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 40, in init _init_distributed_environment(parallel_config, rank, File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 302, in _init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce work = default_pg.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found

(Worker pid=26149) 2023-07-04 08:55:43,893 ERROR worker.py:861 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=26149, ip=10.30.192.153, actor_id=c84d115691be21b19ce79faa01000000, repr=<vllm.worker.worker.Worker object at 0x7f989c09ffd0>) (Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 40, in init (Worker pid=26149) _init_distributed_environment(parallel_config, rank, (Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 302, in _init_distributed_environment (Worker pid=26149) torch.distributed.all_reduce(torch.zeros(1).cuda()) (Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper (Worker pid=26149) return func(*args, **kwargs) (Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce (Worker pid=26149) work = default_pg.allreduce([tensor], opts) (Worker pid=26149) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 (Worker pid=26149) ncclInternalError: Internal check failed. (Worker pid=26149) Last error: (Worker pid=26149) Bootstrap : no socket interface found (Worker pid=26150) RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.

Jul 04 '23 08:07 liulfy

hi, we're having the same issue. Has anyone found a solution for this yet?

Jul 10 '23 21:07 justusmattern27

Same issue here, but I doubt it has nothing to do with ray

Jul 13 '23 08:07 lucasjinreal

mark

Sep 01 '23 16:09 Oliver-ss

mark，i have the same problem

Sep 06 '23 09:09 a1164714

same problem here

Sep 08 '23 04:09 flexwang

I'm having the same issue.

Sep 12 '23 20:09 saumya-saran

same here, mark

Sep 25 '23 05:09 pfldy2850

In my humble opinion, There might be a problem when loading the model checkpoint.

https://github.com/vllm-project/vllm/blob/bbbf86565f2fb2bab0cf6675f9ebefcd449390bd/vllm/model_executor/models/llama.py#L336-L339

For this loop, it needs some cpu memories per GPU device for loading a checkpoint file.

For @liulfy 's case, 9.8GB checkpoint file (pytorch_model-00001-of-00002.bin) loaded on all workers at the same time.

Sep 25 '23 06:09 pfldy2850

Indeed, after sharding my model's checkpoints to small pieces, It works on me normally.

Sep 25 '23 06:09 pfldy2850

I know that there is no way to partially load a large checkpoint file at code level. (To load a checkpoint file, memory of the same size as the checkpoint file is required)

Any ideas on how vLLM can solve these problems?

Sep 25 '23 07:09 pfldy2850

Same issue here，anyone fix it now?

Oct 10 '23 13:10 smallmocha

same here, mark

Oct 13 '23 15:10 lonngxiang

I met the same issue and figured out how to fix it. Already created a PR #1395

Oct 17 '23 11:10 boydfd

@boydfd seems did not fix this issue，not when load model，i get oom after runing several days

Dec 04 '23 12:12 smallmocha

@boydfd seems did not fix this issue，not when load model，i get oom after runing several days

maybe you can share more infos?

Dec 05 '23 05:12 boydfd

@boydfd seems did not fix this issue，not when load model，i get oom after runing several days

maybe you can share more infos?

Same issue here，I'v found some info may help:

1.It goes well when --tensor-parallel-size==1, that is with out ray. The cpu memory usage is static. 2.when set --tensor-parallel-size 2, vllm will use ray. and as the model infers, the cpu memory increases slowly until OOM. 3.If use --enforce-eager along with --tensor-parallel-size 2, the cpu memory increases much slower (near 5X). but will still increase to OOM. 4.Whether running in a container or not will always lead to this mem leaking bug.

Model: llama-7b cuda version 12.1

Dec 30 '23 14:12 AzureSilent

It seems that if turn down the --max_model_len ,it'll start。 for example： stat with the command like: python -m vllm.entrypoints.api_server --model /workspace/model/ --tensor-parallel-size 4 --max-model-len 6000

Jan 03 '24 08:01 chaos318

If anybody run vllm on Triton server Triton server will auto run your llm instance on every possible GPU. So if you have 2 GPU and you run --tensor-parallel-size 2. it will create 2 instances and split that 2 instances. May lead to OOM. Solutions: specify which is your "main" GPUin your config.pbtxt. instance_group [ { count: 1 kind: KIND_GPU gpus: [ 0 ] } ]

Jan 17 '24 03:01 Taiinguyenn139

I wrote same answer to Issue #721, Can you try this?

I had the issue when I'm using a docker container. I was able to circumvent the issue by mounting the empty directory to /tmp/ray. I hope this solution could help someone.

For example,
mkdir ./tmp_local
docker run -v ./tmp_local:/tmp/ray ...

Jan 22 '24 02:01 HAN-oQo

I encountered the same oom error message and I guess there is still no other solution..

model: Llama-2-7b
cuda version: 12.2
vllm version: 0.3.0
multi gpus (8)

Feb 06 '24 13:02 su-park

I resolved my case by enforce_eager=True with slower generations. Thank you all.

Feb 07 '24 22:02 su-park