vllm
vllm copied to clipboard
ray OOM in tensor parallel
In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve this problem? my model is yahma/llama-7b-hf my transformers version is 4.28.0 my cuda version is 11.4
2023-06-30 09:24:53,455 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1
as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1
to mute this warning.
2023-06-30 09:24:53,459 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.12gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-06-30 09:24:53,584 INFO worker.py:1636 -- Started a local Ray instance.
INFO 06-30 09:24:54 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
WARNING 06-30 09:24:54 config.py:131] Possibly too large swap space. 16.00 GiB out of the 32.00 GiB total CPU memory is allocated for the swap space.
/opt/app/yahma-llama-lora
Exception in thread ray_print_logs:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 900, in print_logs
global_worker_stdstream_dispatcher.emit(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/ray_logging.py", line 264, in emit
handle(data)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1788, in print_to_stdstream
print_worker_logs(batch, sink)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1950, in print_worker_logs
restore_tqdm()
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 1973, in restore_tqdm
tqdm_ray.instance().unhide_bars()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 344, in instance
_manager = _BarManager()
File "/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py", line 256, in init
self.should_colorize = not ray.widgets.util.in_notebook()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 205, in in_notebook
shell = _get_ipython_shell_name()
File "/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py", line 194, in _get_ipython_shell_name
import IPython
File "/usr/local/lib/python3.8/dist-packages/IPython/init.py", line 30, in
See IPython README.rst
file for more information:
https://github.com/ipython/ipython/blob/main/README.rst
Traceback (most recent call last):
File "ray logs raylet.out -ip 10.30.192.36
. To see the logs of the worker, use ray logs worker-cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 26333 4.60 ray::Worker.__init__ 26332 4.54 ray::Worker.__init__ 26331 4.51 ray::Worker.__init__ 26330 4.47 ray::Worker.__init__ 25044 0.23 python 25099 0.19 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 25340 0.06 ray::IDLE 25174 0.06 /usr/bin/python /usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 -... 25310 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 25349 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable
RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variable
RAY_memory_monitor_refresh_ms` to zero.
Hi @liulfy, it's because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.
@WoosukKwon Thank you for answering my problem! When I try the swap_space, the problem has not been solved. my code is here: from vllm import LLM model_path = 'yahma/llama-13b-hf' llama_model = LLM(model = model_path, tensor_parallel_size=4, swap_space=1)
my CPU has 32GB memory, and I use 4 A100 40GB.
and the error message is still the same:
2023-07-03 03:27:55,908 WARNING utils.py:593 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1
as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1
to mute this warning.
2023-07-03 03:27:55,911 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=6.08gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-03 03:27:56,045 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-03 03:27:56 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
/opt/app/yahma-llama-lora
Traceback (most recent call last):
File "ray logs raylet.out -ip 10.30.192.36
. To see the logs of the worker, use ray logs worker-ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 51660 4.45 ray::Worker.__init__ 51664 4.45 ray::Worker.__init__ 51658 4.42 ray::Worker.__init__ 51662 4.41 ray::Worker.__init__ 45071 0.27 python 50443 0.18 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 50650 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 50694 0.05 ray::IDLE 50681 0.05 ray::IDLE 50688 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable
RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variable
RAY_memory_monitor_refresh_ms` to zero.
I met same problem.
model:
25G ./llama-13b-lora-hf
free -h
total used free shared buff/cache available
Mem: 31Gi 2.1Gi 26Gi 0.0Ki 2.3Gi 28Gi
Swap: 8.0Gi 1.2Gi 6.8Gi
Initializing an LLM engine with config:
model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0
This is error:
2023-07-04 20:26:39,125 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 192.168.1.240:6379...
2023-07-04 20:26:39,141 INFO worker.py:1636 -- Connected to Ray cluster.
INFO 07-04 20:26:39 llm_engine.py:60] Initializing an LLM engine with config: model='/data/ketadb/text-generation-webui/models/llama-13b-lora-hf/', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)
INFO 07-04 20:26:39 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Traceback (most recent call last):
File "api_server.py", line 80, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 232, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/async_llm_engine.py", line 55, in __init__
self.engine = engine_class(*args, **kwargs)
File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 105, in __init__
self._init_cache()
File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 117, in _init_cache
num_blocks = self._run_workers(
File "/home/ubuntu/wangjibo/vllm-main/vllm/engine/llm_engine.py", line 334, in _run_workers
all_outputs = ray.get(all_outputs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.240, ID: 3d1f1d89602340cb023e506de2a0dd5eb353e2ec29b8800cdf553655) where the task (task ID: ffffffffffffffffaece4988873caddc35d289400c000000, name=Worker.__init__, pid=3290289, memory used=13.74GB) was running was 29.74GB / 31.17GB (0.954021), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.240`. To see the logs of the worker, use `ray logs worker-078e723f2df7c75d778b26c2703d072ddf697853e5f8bdf0e0ba9efa*out -ip 192.168.1.240. Top 10 memory users:
PID MEM(GB) COMMAND
3290289 13.74 ray::Worker.__init__
3290288 13.67 ray::Worker.__init__
3290170 0.21 python api_server.py --model /data/ketadb/text-generation-webui/models/llama-13b-lora-hf/ --tokenize...
3290084 0.11 /home/ubuntu/ketad/agent/subprocess/bin/keta-agent/keta-agent -c keta-agent.yaml
3220981 0.02 ray::IDLE
3219445 0.02 /home/ubuntu/miniconda3/envs/vllm/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_...
3220978 0.02 ray::IDLE
3220980 0.02 ray::IDLE
3220979 0.02 ray::IDLE
3220990 0.02 ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
I guess vllm allocate memory size for model more than it's physical size,Is there a formula for calculating memory size?
Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log
Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log
disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0
work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor
related issue: https://github.com/ray-project/ray/issues/10895
Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log
disable the ray memory monitor by
export RAY_memory_monitor_refresh_ms=0
work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitorrelated issue: ray-project/ray#10895
This does not work for me. I set NCCL_DEBUG=INFO and my log is as follows:
2023-07-04 08:55:35,247 INFO utils.py:573 -- Detected RAY_USE_MULTIPROCESSING_CPU_COUNT=1: Using multiprocessing.cpu_count() to detect the number of CPUs. This may be inconsistent when used inside docker. To correctly detect CPUs, unset the env var: RAY_USE_MULTIPROCESSING_CPU_COUNT
.
2023-07-04 08:55:35,252 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.75gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-04 08:55:35,377 INFO worker.py:1636 -- Started a local Ray instance.
INFO 07-04 08:55:37 llm_engine.py:59] Initializing an LLM engine with config: model='/opt/app/yahma-llama-lora', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
/opt/app/yahma-llama-lora
Traceback (most recent call last):
File "
(Worker pid=26149) 2023-07-04 08:55:43,893 ERROR worker.py:861 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=26149, ip=10.30.192.153, actor_id=c84d115691be21b19ce79faa01000000, repr=<vllm.worker.worker.Worker object at 0x7f989c09ffd0>) (Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 40, in init (Worker pid=26149) _init_distributed_environment(parallel_config, rank, (Worker pid=26149) File "/opt/app/vllm-0.1.1/vllm/worker/worker.py", line 302, in _init_distributed_environment (Worker pid=26149) torch.distributed.all_reduce(torch.zeros(1).cuda()) (Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper (Worker pid=26149) return func(*args, **kwargs) (Worker pid=26149) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce (Worker pid=26149) work = default_pg.allreduce([tensor], opts) (Worker pid=26149) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 (Worker pid=26149) ncclInternalError: Internal check failed. (Worker pid=26149) Last error: (Worker pid=26149) Bootstrap : no socket interface found (Worker pid=26150) RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
hi, we're having the same issue. Has anyone found a solution for this yet?
Same issue here, but I doubt it has nothing to do with ray
mark
mark,i have the same problem
same problem here
I'm having the same issue.
same here, mark
In my humble opinion, There might be a problem when loading the model checkpoint.
https://github.com/vllm-project/vllm/blob/bbbf86565f2fb2bab0cf6675f9ebefcd449390bd/vllm/model_executor/models/llama.py#L336-L339
For this loop, it needs some cpu memories per GPU device for loading a checkpoint file.
For @liulfy 's case, 9.8GB checkpoint file (pytorch_model-00001-of-00002.bin) loaded on all workers at the same time.
Indeed, after sharding my model's checkpoints to small pieces, It works on me normally.
I know that there is no way to partially load a large checkpoint file at code level. (To load a checkpoint file, memory of the same size as the checkpoint file is required)
Any ideas on how vLLM can solve these problems?
Same issue here,anyone fix it now?
same here, mark
I met the same issue and figured out how to fix it. Already created a PR #1395
@boydfd seems did not fix this issue,not when load model,i get oom after runing several days
@boydfd seems did not fix this issue,not when load model,i get oom after runing several days
maybe you can share more infos?
@boydfd seems did not fix this issue,not when load model,i get oom after runing several days
maybe you can share more infos?
Same issue here,I'v found some info may help:
1.It goes well when --tensor-parallel-size==1, that is with out ray. The cpu memory usage is static. 2.when set --tensor-parallel-size 2, vllm will use ray. and as the model infers, the cpu memory increases slowly until OOM. 3.If use --enforce-eager along with --tensor-parallel-size 2, the cpu memory increases much slower (near 5X). but will still increase to OOM. 4.Whether running in a container or not will always lead to this mem leaking bug.
Model: llama-7b cuda version 12.1
It seems that if turn down the --max_model_len ,it'll start。 for example: stat with the command like: python -m vllm.entrypoints.api_server --model /workspace/model/ --tensor-parallel-size 4 --max-model-len 6000
If anybody run vllm on Triton server Triton server will auto run your llm instance on every possible GPU. So if you have 2 GPU and you run --tensor-parallel-size 2. it will create 2 instances and split that 2 instances. May lead to OOM. Solutions: specify which is your "main" GPUin your config.pbtxt. instance_group [ { count: 1 kind: KIND_GPU gpus: [ 0 ] } ]
I wrote same answer to Issue #721, Can you try this?
I had the issue when I'm using a docker container. I was able to circumvent the issue by mounting the empty directory to /tmp/ray. I hope this solution could help someone.
For example,
mkdir ./tmp_local docker run -v ./tmp_local:/tmp/ray ...
I encountered the same oom error message and I guess there is still no other solution..
- model: Llama-2-7b
- cuda version: 12.2
- vllm version: 0.3.0
- multi gpus (8)
I resolved my case by enforce_eager=True
with slower generations.
Thank you all.