verl Unexpected OOM and the process seems to be forked

System Info

2 x V100 32G

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

My script as follows:

ENGINE=${1:-vllm}
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    ray_kwargs.ray_init.num_cpus=8 \
    data.train_files=/public/home/dzhang/pyProject/zhzhang/verl/data/gmner_mner/train.parquet \
    data.val_files=/public/home/dzhang/pyProject/zhzhang/verl/data/gmner_mner/val.parquet \
    data.train_batch_size=1 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='right' \
    actor_rollout_ref.model.path=/public/home/dzhang/pyProject/zhzhang/Qwen2.5-VL-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=3e-6 \
    actor_rollout_ref.actor.strategy="fsdp" \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=1 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.model.lora_rank=8 \
    actor_rollout_ref.model.lora_alpha=16 \
    actor_rollout_ref.model.target_modules=all-linear \
    actor_rollout_ref.model.exclude_modules='.*visual.*' \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=2 \
    actor_rollout_ref.rollout.pipeline_model_parallel_size=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.data_parallel_size=1 \
    actor_rollout_ref.rollout.dtype=float16 \
    actor_rollout_ref.rollout.max_num_batched_tokens=32768 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.fsdp_config.model_dtype=float16 \
    actor_rollout_ref.actor.fsdp_config.model_dtype=float16 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console"]' \
    trainer.project_name='test' \
    trainer.experiment_name='test' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=5 \
    trainer.total_epochs=5 $@

There seems to be 2 processes running on both V100, the memory usage of them is the same as the outputs:

(WorkerDict pid=37110) `torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.17s/it]
(WorkerDict pid=37108) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(WorkerDict pid=37108) `torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
(WorkerDict pid=37108) Monkey patch Qwen2_5_VLForConditionalGeneration model forward
(WorkerDict pid=37108) Monkey patch Qwen2_5_VLForConditionalGeneration attention layer
(WorkerDict pid=37108) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
(WorkerDict pid=37108) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.66s/it]
(WorkerDict pid=37108) Applying LoRA to actor module
(WorkerDict pid=37108) PeftModelForCausalLM contains 3.76B parameters
(WorkerDict pid=37108) wrap_policy: functools.partial(<function _or_policy at 0x7f11b021bbe0>, policies=[functools.partial(<function lambda_auto_wrap_policy at 0x7f11b021b6d0>, lambda_fn=<function get_fsdp_wrap_policy.<locals>.lambda_policy_fn at 0x7eece842eef0>), functools.partial(<function transformer_auto_wrap_policy at 0x7f11b021bac0>, transformer_layer_cls={<class 'transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLDecoderLayer'>, <class 'transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLVisionBlock'>})])
(WorkerDict pid=37108) NCCL version 2.21.5+cuda12.4
(WorkerDict pid=37108) Total steps: 28350, num_warmup_steps: 0
(WorkerDict pid=37110) Monkey patch Qwen2_5_VLForConditionalGeneration model forward
(WorkerDict pid=37110) Monkey patch Qwen2_5_VLForConditionalGeneration attention layer
(WorkerDict pid=37110) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
(WorkerDict pid=37110) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch
(WorkerDict pid=37110) Applying LoRA to actor module
(WorkerDict pid=37108) Actor use_remove_padding=True
(WorkerDict pid=37108) Actor use_fused_kernels=False
(WorkerDict pid=37108) /public/home/dzhang/pyProject/zhzhang/verl/verl/utils/profiler/config.py:49: UserWarning: Torch profiler tool config is not fully supported now.
(WorkerDict pid=37108)   warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.73s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.94s/it]

Memory usage for card 0 and 1

timestamp, index, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.total [MiB]
2025/10/21 18:31:30.304, 0, 11 %, 0 %, 6637 MiB, 32768 MiB
2025/10/21 18:31:30.307, 1, 11 %, 0 %, 6637 MiB, 32768 MiB

During rollout I always occurred CUDA Out Of Memory as:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 31.74 GiB of which 21.38 MiB is free. Including non-PyTorch memory, this process has 31.69 GiB memory in use. Of the allocated memory 31.07 GiB is allocated by PyTorch, and 149.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management

I also tried to set TP=2, but it didn't work. And setting pipeline_model_parallel_size to 2 will cause another error.

Expected behavior

I suppose 32 x 2 GB memory should be enough for training a 3.4B model. How to run the GRPO task properly?

Oct 21 '25 10:10 ZelateCalcite

Set free_cache_engine = True may help

Nov 06 '25 04:11 Zhou-jiecheng

Set free_cache_engine = True may help

I will take a try, thanks

Nov 21 '25 11:11 ZelateCalcite

Previously the config of TP / DP seems not work. After I update the verl code to branch release/v0.6.1, the DP=2 works as expected. And the max VRAM used in my new setting is 36518MiB on 2A100 with

data.train_batch_size=4 \
data.max_prompt_length=1024 \
data.max_response_length=512 \
actor_rollout_ref.model.path=Qwen2.5-VL-3B-Instruct \
actor_rollout_ref.model.lora_rank=8 \
actor_rollout_ref.model.lora_alpha=16 \
actor_rollout_ref.actor.strategy="fsdp2" \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.ppo_mini_batch_size=4 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \

I will try TP=2 later.

Nov 27 '25 09:11 ZelateCalcite