RL
RL copied to clipboard
Resume training from checkpoint fails with "Weights only load failed" error when optimizer_cpu_offload is enabled
Describe the bug
I'm using the nemo-rl 11.11 image, with some local code changes which I think are not relevant to the issue here. Basically when I have optimizer_cpu_offload set to true, resuming from a checkpoint will fail with the following error:
[36m(MegatronPolicyWorker pid=2185076)[0m INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=1e-06, min_lr=1e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp8_recipe=None, fp16=False, bf16=True, reuse_grad_buf_for_mxfp8_param_ag=False, params_dtype='float32', use_precision_aware_optimizer=True, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, decoupled_weight_decay=True, sgd_momentum=0.9, muon_momentum=0.95, muon_split_qkv=True, muon_use_nesterov=False, muon_scale_mode='spectral', muon_fp32_matmul_prec='medium', muon_num_ns_steps=5, muon_tp_mode='blockwise', muon_extra_scale_factor=1.0, use_distributed_optimizer=True, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=True, optimizer_offload_fraction=1, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Traceback (most recent call last):
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/./examples/run_grpo_math.py", line 257, in <module>
main()
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/./examples/run_grpo_math.py", line 189, in main
) = setup(config, tokenizer, dataset, val_dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/algorithms/grpo.py", line 574, in setup
policy.print_node_ip_and_gpu_id()
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/lm_policy.py", line 861, in print_node_ip_and_gpu_id
results = ray.get(
^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, [36mray::lm_policy-0-0:MegatronPolicyWorker.__init__()[39m (pid=2185076, ip=10.65.12.143, actor_id=c44e9a970207f424ebcfae8901000000, repr=MegatronPolicyWorker[rank=0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/megatron_policy_worker.py", line 749, in __init__
) = setup_megatron_model(
^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/nemo_rl/models/policy/megatron_policy_worker.py", line 319, in setup_megatron_model
load_checkpoint(
File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1234, in load_checkpoint
return _load_checkpoint_from_path(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1291, in _load_checkpoint_from_path
state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 2029, in _load_base_checkpoint
return _load_global_dist_base_checkpoint(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1926, in _load_global_dist_base_checkpoint
state_dict = dist_checkpointing.load_common_state_dict(checkpoint_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 189, in load_common_state_dict
return common_strategy.load_common(checkpoint_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_nemorl/users/shuangy/src/nemo-rl-rebase/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py", line 89, in load_common
return torch.load(load_path, map_location='cpu')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/torch/serialization.py", line 1529, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.optim.adamw.AdamW was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torch.optim.adamw.AdamW])` or the `torch.serialization.safe_globals([torch.optim.adamw.AdamW])` context manager to allowlist this global if you trust this class/function.
Log attached here:
Could someone take a look and advise on the root cause of fix? Thanks!
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.