Resume training from checkpoint fails with "Weights only load failed" error when optimizer_cpu_offload is enabled

Open sharonyu-115 opened this issue 1 month ago • 0 comments

Describe the bug

I'm using the nemo-rl 11.11 image, with some local code changes which I think are not relevant to the issue here. Basically when I have optimizer_cpu_offload set to true, resuming from a checkpoint will fail with the following error:

[36m(MegatronPolicyWorker pid=2185076)[0m INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=1e-06, min_lr=1e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp8_recipe=None, fp16=False, bf16=True, reuse_grad_buf_for_mxfp8_param_ag=False, params_dtype='float32', use_precision_aware_optimizer=True, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, decoupled_weight_decay=True, sgd_momentum=0.9, muon_momentum=0.95, muon_split_qkv=True, muon_use_nesterov=False, muon_scale_mode='spectral', muon_fp32_matmul_prec='medium', muon_num_ns_steps=5, muon_tp_mode='blockwise', muon_extra_scale_factor=1.0, use_distributed_optimizer=True, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=True, optimizer_offload_fraction=1, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Traceback (most recent call last):
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/./examples/run_grpo_math.py", line 257, in <module>
    main()
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/./examples/run_grpo_math.py", line 189, in main
    ) = setup(config, tokenizer, dataset, val_dataset)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/algorithms/grpo.py", line 574, in setup
    policy.print_node_ip_and_gpu_id()
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/lm_policy.py", line 861, in print_node_ip_and_gpu_id
    results = ray.get(
              ^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, [36mray::lm_policy-0-0:MegatronPolicyWorker.__init__()[39m (pid=2185076, ip=10.65.12.143, actor_id=c44e9a970207f424ebcfae8901000000, repr=MegatronPolicyWorker[rank=0])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/megatron_policy_worker.py", line 749, in __init__
    ) = setup_megatron_model(
        ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/megatron_policy_worker.py", line 319, in setup_megatron_model
    load_checkpoint(
  File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1234, in load_checkpoint
    return _load_checkpoint_from_path(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1291, in _load_checkpoint_from_path
    state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint(
                                                      ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 2029, in _load_base_checkpoint
    return _load_global_dist_base_checkpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1926, in _load_global_dist_base_checkpoint
    state_dict = dist_checkpointing.load_common_state_dict(checkpoint_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 189, in load_common_state_dict
    return common_strategy.load_common(checkpoint_dir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py", line 89, in load_common
    return torch.load(load_path, map_location='cpu')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/torch/serialization.py", line 1529, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL torch.optim.adamw.AdamW was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torch.optim.adamw.AdamW])` or the `torch.serialization.safe_globals([torch.optim.adamw.AdamW])` context manager to allowlist this global if you trust this class/function.

Log attached here:

ray-driver.log

Could someone take a look and advise on the root cause of fix? Thanks!

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Nov 20 '25 09:11 sharonyu-115