system oom with qwen 235b
Describe the bug
CPU memory usage steadily increasing until OOM. Qwen 235b a22b. OOMs at the end of this chart. Customer-reported, do not have full reproducer yet, but the RL environment is likely not the culprit
--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml policy.model_name=Qwen/Qwen3-235B-A22B cluster.gpus_per_node=8 policy.megatron_cfg.tensor_model_parallel_size=4 policy.megatron_cfg.expert_tensor_parallel_size=1 policy.megatron_cfg.pipeline_model_parallel_size=16 policy.megatron_cfg.expert_model_parallel_size=4 policy.megatron_cfg.context_parallel_size=2 policy.megatron_cfg.sequence_parallel=True policy.generation.vllm_cfg.tensor_parallel_size=16 policy.generation.vllm_cfg.pipeline_parallel_size=1 cluster.num_nodes=32 policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 policy.megatron_cfg.num_layers_in_last_pipeline_stage=5 policy.max_total_sequence_length=8192 policy.train_global_batch_size=512 grpo.num_generations_per_prompt=16 grpo.num_prompts_per_step=32 policy.generation.vllm_cfg.enforce_eager=True
@ZhiyuLi-Nvidia is this something you can review ?
@cmunley1 which version or branch were you using? There's a recent fix of memory leak relevant to YARN https://github.com/NVIDIA-NeMo/RL/pull/1163
shared the branch and we will test the PR above. thanks @ZhiyuLi-Nvidia
@ZhiyuLi-Nvidia I think this error is with cpu memory leak, not gpu memory. the memory leak seems to happen very ~300 steps repeatedly, it is hard to debug with limited info, hope the user can provide more information like, if they know what caused the memory usage increase at every 300 step?
Thank you @guyueh1
I have tried with updated mcore version and haven't seen any cpu memory leak in reproduction and shared that with @cmunley1 on Monday:
I have bump up mcore version with a memory leak fix, could you take a try?
- branch:https://github.com/bxyu-nvidia/NeMo-RL-private/tree/zhiyul/aviary-rl-bump-up-mcore-w-fix
- commit: https://github.com/bxyu-nvidia/NeMo-RL-private/commit/14150b18fe9af7f6addc0d8be6282d303a6ed0e5 Looks like system memory if quite flat
- exp link: https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/gxkd9ofz
@cmunley1 let me know if it is helpful or not.
On a separate chat, there's some memory OOM simply because huge metric variables. And @bxyu-nvidia solved it by cleaning up the variables.
Thanks both. We are testing this
Unable to reproduce this customer issue so far.
I have tried but the cpu memory increasing is almost negletable?
- branch: https://github.com/NVIDIA-NeMo/RL/compare/main...zhiyul/oom_repro_w_cpu_profiler
- change on top of the guide: https://github.com/NVIDIA-NeMo/RL/compare/f67ccd9e9cf7e2c1b30c23b6cb2c305bf1dfff36...zhiyul/oom_repro_w_cpu_profiler What's new
- added profiler feature to track cpu memory in each step, which I also hope to track in customer's env
- some adjustments to ensure compatibility of the code and env setup
- exchange run_time from PY_EXECUTABLES.SYSTEM to PY_EXECUTABLES.ETHER0 so that we can still use container as well as uv env setup (instead of the no_container setup in the guide)
My exp: every step has like 2 digit MB increasing
- https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/8uotje8i
We can further improve if we remove some generation intermediate data:
- branch: https://github.com/NVIDIA-NeMo/RL/compare/zhiyul/oom_repro_w_cpu_profiler...zhiyul/oom_repro_w_cpu_profiler_optional_rm_data
- https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/e6ygl9xc
The latest issue using zhiyul/oom_repro_w_cpu_profiler_optional_rm_data is:
training ran for 32 steps then crashed with:
[36m(MegatronPolicyWorker[rank=34] pid=1840232, ip=10.5.33.3)[0m [2025-11-25 15:20:28,463 E 1840232 1840232] logging.cc:118: Unhandled exception: N3c105ErrorE. what(): could not unlink the shared memory file /torch_1840232_807965985_32
we suggested adding RemoveIPC=no to /etc/systemd/logind.conf based on http://gpu-comms-head/nccl/doc/html/troubleshooting.html#systemd, or trying these env vars.
export NCCL_DEBUG=INFO
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=simple
export NCCL_NVLS_ENABLE=0
also suggested using containers with pyxis/enroot.
error has not been seen on single node
this is blocking training runs.