Qwen3-vl-30B-A3B 用megatron backend在加载checkpoint时报错

Open oswen opened this issue 2 months ago • 1 comments

System Info

----------Python Info---------- Version : 3.12.3 Compiler : GCC 13.3.0 Build : ('main', 'Feb 4 2025 14:48:35') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.2 Directory : /usr/local/lib/python3.12/dist-packages/pip vllm : 0.11.0 sglang : not found. ray : 2.49.2 torch : 2.8.0 ----------verl Info----------- Version : 0.7.0.dev Directory : /apdcephfs_303690327/share_303690327/wangrui/code/verl_1104/verl/verl Commit Hash : b49178f0f3ac6143680ecc6ca1184ad30aa85907 ----------Platform Info---------- Platform : Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.39 system : Linux node : TENCENT64.site release : 5.4.241-1-tlinux4-0017.7 version : #1 SMP Thu Jan 18 11:33:00 CST 2024 ----------Environment---------- CUDA Runtime : 12.8 CUDA compiler : Not found: [Errno 2] No such file or directory: 'nvcc' ----------System Info---------- CPU Memory : 2265.25 GB GPU Count : 8 GPU 1 Type : NVIDIA H20 GPU 1 Memory : 95.58 GB GPU 2 Type : NVIDIA H20 GPU 2 Memory : 95.58 GB GPU 3 Type : NVIDIA H20 GPU 3 Memory : 95.58 GB GPU 4 Type : NVIDIA H20 GPU 4 Memory : 95.58 GB GPU 5 Type : NVIDIA H20 GPU 5 Memory : 95.58 GB GPU 6 Type : NVIDIA H20 GPU 6 Memory : 95.58 GB GPU 7 Type : NVIDIA H20 GPU 7 Memory : 95.58 GB GPU 8 Type : NVIDIA H20 GPU 8 Memory : 95.58 GB

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

在加载模型checkpionts报错：

[36m(WorkerDict pid=162954, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:24,287:[Rank 9] Loaded HF model checkpoint from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor/huggingface with bridge [36m(WorkerDict pid=162954, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:31,707:[Rank 9] Loaded optimizer checkpoint from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor [36m(WorkerDict pid=162954, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:31,707:[Rank 9] Loaded RNG states from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor [36m(WorkerDict pid=162953, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:27,758:[Rank 8] Loaded HF model checkpoint from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor/huggingface with bridge[32m [repeated 7x across cluster][0m Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/data/train.parquet', 'data.val_files=/apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/data/test.parquet', 'data.train_batch_size=64', 'data.max_prompt_length=4096', 'data.max_response_length=2048', 'data.shuffle=True', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'data.image_key=images', 'custom_reward_function.path=./examples/reward_fns/gui_reward_mixed_1103.py', 'custom_reward_function.name=gui_reward_fn', 'actor_rollout_ref.model.path=/apdcephfs_303690327/share_303690327/private_jasperrwang/outputs/sft/1030_30B/ckpt/checkpoint-500', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=16', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.01', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.ppo_max_token_len_per_gpu=6144', 'actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True', 'actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=20480', 'actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True', 'actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=20480', 'actor_rollout_ref.rollout.name=vllm', '+actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.7', 'actor_rollout_ref.rollout.n=8', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1', 'actor_rollout_ref.actor.megatron.use_mbridge=True', 'actor_rollout_ref.actor.megatron.param_offload=True', 'actor_rollout_ref.actor.megatron.optimizer_offload=True', 'actor_rollout_ref.actor.megatron.grad_offload=True', 'actor_rollout_ref.ref.megatron.param_offload=True', '+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1', '+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True', '+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True', '+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32', '+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type=flex', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1', '+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True', 'algorithm.use_kl_in_reward=False', 'trainer.default_local_dir=/apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1', 'trainer.critic_warmup=0', 'trainer.logger=["console","tensorboard"]', 'trainer.project_name=verl_grpo_gui_1021_4', 'trainer.experiment_name=qwen2_5_vl_7b_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=2', 'trainer.save_freq=30', 'trainer.test_freq=30', 'trainer.total_epochs=20'] [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162956, ip=29.160.161.21, actor_id=35dbd3691cbdd52988a2912d02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f9fe01a6780>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Traceback (most recent call last): File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/trainer/main_ppo.py", line 42, in main run_ppo(config) File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/trainer/main_ppo.py", line 96, in run_ppo ray.get(runner.run.remote(config)) File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2882, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 968, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AcceleratorError): [36mray::TaskRunner.run()[39m (pid=865216, ip=29.160.160.92, actor_id=b534ae2ee51facbe754d987a02000000, repr=<main_ppo.TaskRunner object at 0x7efcad8b5610>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/trainer/main_ppo.py", line 341, in run trainer.fit() File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/trainer/ppo/ray_trainer.py", line 1037, in fit self._load_checkpoint() File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/trainer/ppo/ray_trainer.py", line 880, in _load_checkpoint self.actor_rollout_wg.load_checkpoint( File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 48, in call output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(AcceleratorError): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162954, ip=29.160.161.21, actor_id=8f443212f0c29a65a2de0a9a02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f6231b41dc0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint offload_megatron_model_to_cpu(self.actor_module) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162955, ip=29.160.161.21, actor_id=8b20c8bf7d8028ac2785072202000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f4bac31d790>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162957, ip=29.160.161.21, actor_id=5ca86437b7620b1ad171bc4a02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef283202210>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162960, ip=29.160.161.21, actor_id=ca1184f9830907a76f90eb3c02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fa922f9e300>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162959, ip=29.160.161.21, actor_id=a319e6ee4a447a112aa25e3302000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f547ebd5df0>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162953, ip=29.160.161.21, actor_id=0be1ad6bf152edf8ac59aa0402000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f3ddd6d9eb0>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(TaskRunner pid=865216)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_load_checkpoint()[39m (pid=162958, ip=29.160.161.21, actor_id=d01dff83ec6ba3307859ac3902000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7faddbde1f40>) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/ray/base.py", line 700, in func [36m(TaskRunner pid=865216)[0m return getattr(self.worker_dict[key], name)(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/single_controller/base/decorator.py", line 442, in inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/transferqueue_utils.py", line 199, in dummy_inner [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/workers/megatron_workers.py", line 774, in load_checkpoint [36m(TaskRunner pid=865216)[0m offload_megatron_model_to_cpu(self.actor_module) [36m(TaskRunner pid=865216)[0m File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context [36m(TaskRunner pid=865216)[0m return func(*args, **kwargs) [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(WorkerDict pid=162960, ip=29.160.161.21)[0m kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 2048, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}[32m [repeated 15x across cluster][0m [36m(WorkerDict pid=162953, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:32,262:[Rank 8] Loaded optimizer checkpoint from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor[32m [repeated 7x across cluster][0m [36m(WorkerDict pid=162953, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:32,263:[Rank 8] Loaded RNG states from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor[32m [repeated 7x across cluster][0m

[31m---------------------------------------[39m [31mJob 'raysubmit_LcEJ2U8fqKdX8DgP' failed[39m [31m---------------------------------------[39m

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars): [36m(TaskRunner pid=865216)[0m File "/tmp/ray/session_2025-11-06_11-14-15_386490_804198/runtime_resources/working_dir_files/_ray_pkg_8c7529be4d02c87d/verl/utils/megatron_utils.py", line 337, in offload_megatron_model_to_cpu [36m(TaskRunner pid=865216)[0m buffer.param_data.cpu_data = buffer.param_data.data.cpu().pin_memory() [36m(TaskRunner pid=865216)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [36m(TaskRunner pid=865216)[0m torch.AcceleratorError: CUDA error: invalid argument [36m(TaskRunner pid=865216)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [36m(TaskRunner pid=865216)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [36m(TaskRunner pid=865216)[0m Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [36m(WorkerDict pid=162960, ip=29.160.161.21)[0m kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 2048, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}[32m [repeated 15x across cluster][0m [36m(WorkerDict pid=162953, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:32,262:[Rank 8] Loaded optimizer checkpoint from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor[32m [repeated 7x across cluster][0m [36m(WorkerDict pid=162953, ip=29.160.161.21)[0m INFO:2025-11-06 11:27:32,263:[Rank 8] Loaded RNG states from /apdcephfs_303690327/share_303690327/wangrui/outputs/rl/1103/output1/global_step_540/actor[32m [repeated 7x across cluster][0m

Expected behavior

正常应该可以直接从已经保存的checkpionts加载的吧，megatron作为backend

Nov 06 '25 04:11 oswen

same issue

Nov 12 '25 08:11 luzengxiangcn