verl icon indicating copy to clipboard operation
verl copied to clipboard

Out Of Memory

Open ioir123ju opened this issue 3 months ago • 14 comments

System Info

----------Python Info---------- Version : 3.10.12 Compiler : GCC 11.4.0 Build : ('main', 'Jul 29 2024 16:56:48') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.2 Directory : /usr/local/lib/python3.10/dist-packages/pip vllm : not found. sglang : 0.4.10.post2 ray : 2.47.1 torch : 2.7.1 ----------verl Info----------- Version : 0.5.0.dev Directory : /data2/code/verl/verl Commit Hash : 2d6c6dbb39bf846d4ebf98c89fc5b4f49c37dd3d ----------Platform Info---------- Platform : Linux-5.15.0-153-generic-x86_64-with-glibc2.35 system : Linux node : zktitan release : 5.15.0-153-generic version : #163-Ubuntu SMP Thu Aug 7 16:37:18 UTC 2025 ----------Environment---------- CUDA Runtime : 12.6 CUDA Compiler : Cuda compilation tools, release 12.6, V12.6.20 ----------System Info---------- CPU Memory : 251.54 GB GPU Count : 8 GPU 1 Type : NVIDIA RTX A6000 GPU 1 Memory : 47.99 GB GPU 2 Type : NVIDIA RTX A6000 GPU 2 Memory : 47.99 GB GPU 3 Type : NVIDIA RTX A6000 GPU 3 Memory : 47.99 GB GPU 4 Type : NVIDIA RTX A6000 GPU 4 Memory : 47.99 GB GPU 5 Type : NVIDIA RTX A6000 GPU 5 Memory : 47.99 GB GPU 6 Type : NVIDIA RTX A6000 GPU 6 Memory : 47.99 GB GPU 7 Type : NVIDIA RTX A6000 GPU 7 Memory : 47.99 GB GPU 8 Type : NVIDIA RTX A6000 GPU 8 Memory : 47.99 GB

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

#!/bin/bash set -x ulimit -n 65535

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=rloo
data.train_files=./data/train.parquet
data.val_files=./data/train.parquet
data.train_batch_size=8
data.max_prompt_length=7168
data.max_response_length=1024
data.filter_overlong_prompts=False
data.truncation='error'
data.image_key=images
custom_reward_function.path=./recipe/infigui-g1/reward_fn.py
custom_reward_function.name=aer_gui_reward_function
actor_rollout_ref.model.path=Qwen/Qwen2.5-VL-3B-Instruct
actor_rollout_ref.model.enable_activation_offload=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.use_dynamic_bsz=False
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=0
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.clip_ratio_high=0.4
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=sglang
actor_rollout_ref.rollout.gpu_memory_utilization=0.7
actor_rollout_ref.rollout.max_num_batched_tokens=8192
actor_rollout_ref.rollout.enable_chunked_prefill=False
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=True
actor_rollout_ref.rollout.n=8
actor_rollout_ref.rollout.temperature=1.0
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=False
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.use_kl_in_reward=False
trainer.logger=['console','swanlab']
trainer.project_name='infigui-g1'
trainer.experiment_name='3b'
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.save_freq=16
trainer.test_freq=16
trainer.total_epochs=6

Expected behavior

error ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 172.18.32.76, ID: 9ef5b7c727bb9d98fb88a87a58d5fcb9049b7d024402c2476dc1697d) where the task (task ID: ffffffffffffffff391ced84aaa8fb3afcc97cf201000000, name=TaskRunner.__init__, pid=286630, memory used=0.37GB) was running was 250.04GB / 251.54GB (0.99404), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 5717261e0cbc6568e04888a62d2efba6589f026e5c378cd5120cd76a) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 172.18.32.76. To see the logs of the worker, use ray logs worker-5717261e0cbc6568e04888a62d2efba6589f026e5c378cd5120cd76a*out -ip 172.18.32.76. Top 10 memory users: PID MEM(GB) COMMAND 230494 229.84 ray::TaskRunner.run 279156 0.55 python3 -m verl.trainer.main_ppo algorithm.adv_estimator=rloo data.train_files=./data/train.parquet ... 286630 0.37 ray::IDLE 2650 0.36 /root/.vscode-server/bin/6f17636121051a53c88d3e605c491d22af2ba755/node --dns-result-order=ipv4first ... 279795 0.32 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2... 280775 0.09 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address... 279931 0.07 /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1... 280042 0.07 ray-dashboard-ReportHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_... 280044 0.07 ray-dashboard-StateHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_m... 280039 0.07 ray-dashboard-EventHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_m... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms to zero.`

ioir123ju avatar Sep 11 '25 10:09 ioir123ju

I have the same issue.

Wangningsen avatar Sep 12 '25 17:09 Wangningsen

Is there any update on this/ is someone looking into this?

koceja avatar Sep 22 '25 22:09 koceja

seems like every process involved in FSDP(node count * gpu per node) loads model then this is natural

low_cpu_mem_usage = True, can I use this in verl modelloader?

junhyeok-motech avatar Sep 23 '25 02:09 junhyeok-motech

Same issue here as well..

Viagounet avatar Oct 10 '25 09:10 Viagounet

Same issue

zhenghaoxu-gatech avatar Oct 13 '25 03:10 zhenghaoxu-gatech

Same issue

glowwormX avatar Oct 20 '25 08:10 glowwormX

same issue, anyone solve it?

EazyGOO avatar Oct 24 '25 22:10 EazyGOO

same issue

danshi777 avatar Oct 30 '25 12:10 danshi777

same issue

Ariya12138 avatar Nov 07 '25 14:11 Ariya12138

same issue

junming-yang avatar Nov 07 '25 14:11 junming-yang

same issue

changmenseng avatar Nov 12 '25 04:11 changmenseng

same issue. There seems to be a memory leak, the training goes well until a huge spike, not even when a checkpoint needs to be saved.

gm-kns avatar Nov 17 '25 00:11 gm-kns

same

Yumaokk avatar Nov 18 '25 11:11 Yumaokk

Same here :( Any updates on this?

arijit-kensho1 avatar Nov 24 '25 21:11 arijit-kensho1