RL system oom with qwen 235b

Describe the bug

CPU memory usage steadily increasing until OOM. Qwen 235b a22b. OOMs at the end of this chart. Customer-reported, do not have full reproducer yet, but the RL environment is likely not the culprit

--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml policy.model_name=Qwen/Qwen3-235B-A22B cluster.gpus_per_node=8 policy.megatron_cfg.tensor_model_parallel_size=4 policy.megatron_cfg.expert_tensor_parallel_size=1 policy.megatron_cfg.pipeline_model_parallel_size=16 policy.megatron_cfg.expert_model_parallel_size=4 policy.megatron_cfg.context_parallel_size=2 policy.megatron_cfg.sequence_parallel=True policy.generation.vllm_cfg.tensor_parallel_size=16 policy.generation.vllm_cfg.pipeline_parallel_size=1 cluster.num_nodes=32 policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 policy.megatron_cfg.num_layers_in_last_pipeline_stage=5 policy.max_total_sequence_length=8192 policy.train_global_batch_size=512 grpo.num_generations_per_prompt=16 grpo.num_prompts_per_step=32 policy.generation.vllm_cfg.enforce_eager=True

Oct 28 '25 18:10 cmunley1

@ZhiyuLi-Nvidia is this something you can review ?

Oct 29 '25 06:10 euronymous-aithal

@ZhiyuLi-Nvidia is this something you can review ?

Sure.

Oct 29 '25 06:10 ZhiyuLi-Nvidia

@cmunley1 which version or branch were you using? There's a recent fix of memory leak relevant to YARN https://github.com/NVIDIA-NeMo/RL/pull/1163

Oct 29 '25 06:10 ZhiyuLi-Nvidia

shared the branch and we will test the PR above. thanks @ZhiyuLi-Nvidia

Oct 29 '25 18:10 cmunley1

@ZhiyuLi-Nvidia I think this error is with cpu memory leak, not gpu memory. the memory leak seems to happen very ~300 steps repeatedly, it is hard to debug with limited info, hope the user can provide more information like, if they know what caused the memory usage increase at every 300 step?

Nov 06 '25 18:11 guyueh1

Thank you @guyueh1

I have tried with updated mcore version and haven't seen any cpu memory leak in reproduction and shared that with @cmunley1 on Monday:

I have bump up mcore version with a memory leak fix, could you take a try?

branch:https://github.com/bxyu-nvidia/NeMo-RL-private/tree/zhiyul/aviary-rl-bump-up-mcore-w-fix
commit: https://github.com/bxyu-nvidia/NeMo-RL-private/commit/14150b18fe9af7f6addc0d8be6282d303a6ed0e5 Looks like system memory if quite flat
exp link: https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/gxkd9ofz

@cmunley1 let me know if it is helpful or not.

On a separate chat, there's some memory OOM simply because huge metric variables. And @bxyu-nvidia solved it by cleaning up the variables.

Nov 06 '25 19:11 ZhiyuLi-Nvidia

Thanks both. We are testing this

Nov 06 '25 19:11 cmunley1

Unable to reproduce this customer issue so far.

Nov 11 '25 16:11 cmunley1

I have tried but the cpu memory increasing is almost negletable?

branch: https://github.com/NVIDIA-NeMo/RL/compare/main...zhiyul/oom_repro_w_cpu_profiler
change on top of the guide: https://github.com/NVIDIA-NeMo/RL/compare/f67ccd9e9cf7e2c1b30c23b6cb2c305bf1dfff36...zhiyul/oom_repro_w_cpu_profiler What's new
added profiler feature to track cpu memory in each step, which I also hope to track in customer's env
some adjustments to ensure compatibility of the code and env setup
- exchange run_time from PY_EXECUTABLES.SYSTEM to PY_EXECUTABLES.ETHER0 so that we can still use container as well as uv env setup (instead of the no_container setup in the guide)

My exp: every step has like 2 digit MB increasing

https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/8uotje8i

We can further improve if we remove some generation intermediate data:

branch: https://github.com/NVIDIA-NeMo/RL/compare/zhiyul/oom_repro_w_cpu_profiler...zhiyul/oom_repro_w_cpu_profiler_optional_rm_data

https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/e6ygl9xc

Nov 24 '25 21:11 ZhiyuLi-Nvidia

The latest issue using zhiyul/oom_repro_w_cpu_profiler_optional_rm_data is:

training ran for 32 steps then crashed with:

[36m(MegatronPolicyWorker[rank=34] pid=1840232, ip=10.5.33.3)[0m [2025-11-25 15:20:28,463 E 1840232 1840232] logging.cc:118: Unhandled exception: N3c105ErrorE. what(): could not unlink the shared memory file /torch_1840232_807965985_32

we suggested adding RemoveIPC=no to /etc/systemd/logind.conf based on http://gpu-comms-head/nccl/doc/html/troubleshooting.html#systemd, or trying these env vars.

export NCCL_DEBUG=INFO
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=simple
export NCCL_NVLS_ENABLE=0

also suggested using containers with pyxis/enroot.

error has not been seen on single node

this is blocking training runs.

Dec 02 '25 22:12 cmunley1