RL icon indicating copy to clipboard operation
RL copied to clipboard

system oom with qwen 235b

Open cmunley1 opened this issue 1 month ago • 9 comments

Describe the bug

CPU memory usage steadily increasing until OOM. Qwen 235b a22b. OOMs at the end of this chart. Customer-reported, do not have full reproducer yet, but the RL environment is likely not the culprit

Image
--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml policy.model_name=Qwen/Qwen3-235B-A22B cluster.gpus_per_node=8 policy.megatron_cfg.tensor_model_parallel_size=4 policy.megatron_cfg.expert_tensor_parallel_size=1 policy.megatron_cfg.pipeline_model_parallel_size=16 policy.megatron_cfg.expert_model_parallel_size=4 policy.megatron_cfg.context_parallel_size=2 policy.megatron_cfg.sequence_parallel=True policy.generation.vllm_cfg.tensor_parallel_size=16 policy.generation.vllm_cfg.pipeline_parallel_size=1 cluster.num_nodes=32 policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 policy.megatron_cfg.num_layers_in_last_pipeline_stage=5 policy.max_total_sequence_length=8192 policy.train_global_batch_size=512 grpo.num_generations_per_prompt=16 grpo.num_prompts_per_step=32 policy.generation.vllm_cfg.enforce_eager=True

cmunley1 avatar Oct 28 '25 18:10 cmunley1

@ZhiyuLi-Nvidia is this something you can review ?

euronymous-aithal avatar Oct 29 '25 06:10 euronymous-aithal

@ZhiyuLi-Nvidia is this something you can review ?

Sure.

ZhiyuLi-Nvidia avatar Oct 29 '25 06:10 ZhiyuLi-Nvidia

@cmunley1 which version or branch were you using? There's a recent fix of memory leak relevant to YARN https://github.com/NVIDIA-NeMo/RL/pull/1163

ZhiyuLi-Nvidia avatar Oct 29 '25 06:10 ZhiyuLi-Nvidia

shared the branch and we will test the PR above. thanks @ZhiyuLi-Nvidia

cmunley1 avatar Oct 29 '25 18:10 cmunley1

@ZhiyuLi-Nvidia I think this error is with cpu memory leak, not gpu memory. the memory leak seems to happen very ~300 steps repeatedly, it is hard to debug with limited info, hope the user can provide more information like, if they know what caused the memory usage increase at every 300 step?

guyueh1 avatar Nov 06 '25 18:11 guyueh1

Thank you @guyueh1

I have tried with updated mcore version and haven't seen any cpu memory leak in reproduction and shared that with @cmunley1 on Monday:

I have bump up mcore version with a memory leak fix, could you take a try?

  • branch:https://github.com/bxyu-nvidia/NeMo-RL-private/tree/zhiyul/aviary-rl-bump-up-mcore-w-fix
  • commit: https://github.com/bxyu-nvidia/NeMo-RL-private/commit/14150b18fe9af7f6addc0d8be6282d303a6ed0e5 Looks like system memory if quite flat
  • exp link: https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/gxkd9ofz
Image

@cmunley1 let me know if it is helpful or not.

On a separate chat, there's some memory OOM simply because huge metric variables. And @bxyu-nvidia solved it by cleaning up the variables.

ZhiyuLi-Nvidia avatar Nov 06 '25 19:11 ZhiyuLi-Nvidia

Thanks both. We are testing this

cmunley1 avatar Nov 06 '25 19:11 cmunley1

Unable to reproduce this customer issue so far.

Image

cmunley1 avatar Nov 11 '25 16:11 cmunley1

I have tried but the cpu memory increasing is almost negletable?

  • branch: https://github.com/NVIDIA-NeMo/RL/compare/main...zhiyul/oom_repro_w_cpu_profiler
  • change on top of the guide: https://github.com/NVIDIA-NeMo/RL/compare/f67ccd9e9cf7e2c1b30c23b6cb2c305bf1dfff36...zhiyul/oom_repro_w_cpu_profiler What's new
  • added profiler feature to track cpu memory in each step, which I also hope to track in customer's env
  • some adjustments to ensure compatibility of the code and env setup
    • exchange run_time from PY_EXECUTABLES.SYSTEM to PY_EXECUTABLES.ETHER0 so that we can still use container as well as uv env setup (instead of the no_container setup in the guide)

My exp: every step has like 2 digit MB increasing

  • https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/8uotje8i Image

We can further improve if we remove some generation intermediate data:

  • branch: https://github.com/NVIDIA-NeMo/RL/compare/zhiyul/oom_repro_w_cpu_profiler...zhiyul/oom_repro_w_cpu_profiler_optional_rm_data
  • https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/e6ygl9xc Image

ZhiyuLi-Nvidia avatar Nov 24 '25 21:11 ZhiyuLi-Nvidia

The latest issue using zhiyul/oom_repro_w_cpu_profiler_optional_rm_data is:

training ran for 32 steps then crashed with:

[36m(MegatronPolicyWorker[rank=34] pid=1840232, ip=10.5.33.3)[0m [2025-11-25 15:20:28,463 E 1840232 1840232] logging.cc:118: Unhandled exception: N3c105ErrorE. what(): could not unlink the shared memory file /torch_1840232_807965985_32 

we suggested adding RemoveIPC=no to /etc/systemd/logind.conf based on http://gpu-comms-head/nccl/doc/html/troubleshooting.html#systemd, or trying these env vars.

export NCCL_DEBUG=INFO
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=simple
export NCCL_NVLS_ENABLE=0

also suggested using containers with pyxis/enroot.

error has not been seen on single node

this is blocking training runs.

cmunley1 avatar Dec 02 '25 22:12 cmunley1