Out Of Memory
System Info
----------Python Info---------- Version : 3.10.12 Compiler : GCC 11.4.0 Build : ('main', 'Jul 29 2024 16:56:48') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.2 Directory : /usr/local/lib/python3.10/dist-packages/pip vllm : not found. sglang : 0.4.10.post2 ray : 2.47.1 torch : 2.7.1 ----------verl Info----------- Version : 0.5.0.dev Directory : /data2/code/verl/verl Commit Hash : 2d6c6dbb39bf846d4ebf98c89fc5b4f49c37dd3d ----------Platform Info---------- Platform : Linux-5.15.0-153-generic-x86_64-with-glibc2.35 system : Linux node : zktitan release : 5.15.0-153-generic version : #163-Ubuntu SMP Thu Aug 7 16:37:18 UTC 2025 ----------Environment---------- CUDA Runtime : 12.6 CUDA Compiler : Cuda compilation tools, release 12.6, V12.6.20 ----------System Info---------- CPU Memory : 251.54 GB GPU Count : 8 GPU 1 Type : NVIDIA RTX A6000 GPU 1 Memory : 47.99 GB GPU 2 Type : NVIDIA RTX A6000 GPU 2 Memory : 47.99 GB GPU 3 Type : NVIDIA RTX A6000 GPU 3 Memory : 47.99 GB GPU 4 Type : NVIDIA RTX A6000 GPU 4 Memory : 47.99 GB GPU 5 Type : NVIDIA RTX A6000 GPU 5 Memory : 47.99 GB GPU 6 Type : NVIDIA RTX A6000 GPU 6 Memory : 47.99 GB GPU 7 Type : NVIDIA RTX A6000 GPU 7 Memory : 47.99 GB GPU 8 Type : NVIDIA RTX A6000 GPU 8 Memory : 47.99 GB
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
#!/bin/bash set -x ulimit -n 65535
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=rloo
data.train_files=./data/train.parquet
data.val_files=./data/train.parquet
data.train_batch_size=8
data.max_prompt_length=7168
data.max_response_length=1024
data.filter_overlong_prompts=False
data.truncation='error'
data.image_key=images
custom_reward_function.path=./recipe/infigui-g1/reward_fn.py
custom_reward_function.name=aer_gui_reward_function
actor_rollout_ref.model.path=Qwen/Qwen2.5-VL-3B-Instruct
actor_rollout_ref.model.enable_activation_offload=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.use_dynamic_bsz=False
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=0
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.clip_ratio_high=0.4
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=sglang
actor_rollout_ref.rollout.gpu_memory_utilization=0.7
actor_rollout_ref.rollout.max_num_batched_tokens=8192
actor_rollout_ref.rollout.enable_chunked_prefill=False
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=True
actor_rollout_ref.rollout.n=8
actor_rollout_ref.rollout.temperature=1.0
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=False
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.use_kl_in_reward=False
trainer.logger=['console','swanlab']
trainer.project_name='infigui-g1'
trainer.experiment_name='3b'
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.save_freq=16
trainer.test_freq=16
trainer.total_epochs=6
Expected behavior
error
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 172.18.32.76, ID: 9ef5b7c727bb9d98fb88a87a58d5fcb9049b7d024402c2476dc1697d) where the task (task ID: ffffffffffffffff391ced84aaa8fb3afcc97cf201000000, name=TaskRunner.__init__, pid=286630, memory used=0.37GB) was running was 250.04GB / 251.54GB (0.99404), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 5717261e0cbc6568e04888a62d2efba6589f026e5c378cd5120cd76a) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 172.18.32.76. To see the logs of the worker, use ray logs worker-5717261e0cbc6568e04888a62d2efba6589f026e5c378cd5120cd76a*out -ip 172.18.32.76. Top 10 memory users:
PID MEM(GB) COMMAND
230494 229.84 ray::TaskRunner.run
279156 0.55 python3 -m verl.trainer.main_ppo algorithm.adv_estimator=rloo data.train_files=./data/train.parquet ...
286630 0.37 ray::IDLE
2650 0.36 /root/.vscode-server/bin/6f17636121051a53c88d3e605c491d22af2ba755/node --dns-result-order=ipv4first ...
279795 0.32 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
280775 0.09 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address...
279931 0.07 /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1...
280042 0.07 ray-dashboard-ReportHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_...
280044 0.07 ray-dashboard-StateHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_m...
280039 0.07 ray-dashboard-EventHead-0 (/usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_m...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms to zero.`
I have the same issue.
Is there any update on this/ is someone looking into this?
seems like every process involved in FSDP(node count * gpu per node) loads model then this is natural
low_cpu_mem_usage = True, can I use this in verl modelloader?
Same issue here as well..
Same issue
Same issue
same issue, anyone solve it?
same issue
same issue
same issue
same issue
same issue. There seems to be a memory leak, the training goes well until a huge spike, not even when a checkpoint needs to be saved.
same
Same here :( Any updates on this?