mindspeed train error: group_type currently only support -1 and 0, current value is 2
我使用verl 1030的main代码,按 Dockerfile.ascend_8.2.rc1_a2安装环境 跑recipe/dapo/run_dapo_qwen3_moe_30b_megatron_npu.sh,初始化、推理均已完成,但训练时报错:
ray.exceptions.RayTaskError(RuntimeError): [36mray::WorkerDict.actor_rollout_update_actor()[39m (pid=490071, ip=172.16.2.11, actor_id=dee0d43a6f32372ec4ff655e04000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0xffcf8e378eb0>)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/single_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/single_controller/base/decorator.py", line 442, in inner
return func(*args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/transferqueue_utils.py", line 199, in dummy_inner
return func(*args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/profile.py", line 256, in wrapper
return func(self_instance, *args, **kwargs_inner)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/megatron_workers.py", line 632, in update_actor
metrics = self.actor.update_policy(dataloader=dataloader)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/actor/megatron_actor.py", line 648, in update_policy
metric_micro_batch = self.forward_backward_batch(
File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/actor/megatron_actor.py", line 586, in forward_backward_batch
losses_reduced = forward_backward_func(
File "/cache/algo/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1932, in forward_backward_pipelining_without_interleaving
input_tensor_grad = backward_step(
File "/cache/algo/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 395, in backward_step
torch.autograd.backward(output_tensor[0], grad_tensors=output_tensor_grad[0])
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
_engine_run_backward(
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
File "/cache/algo/Megatron-LM/megatron/core/tensor_parallel/random.py", line 455, in backward
torch.autograd.backward(outputs, args)
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
_engine_run_backward(
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 60, in backward
grad_weight = torch_npu.npu_grouped_matmul([inp.T], [grad_output], bias=None, group_list=group_list,
File "/cache/verl_env/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: group_type currently only support -1 and 0, current value is 2
另外升级了vllm,vllm0.11.0+vllm0.11.0rc0+torch2.7.1,也是报一样的错,而使用fsdp正常训练
@Shangwei-Li @wlf-darkmatter 可以帮忙看下么
torch_npu之前是2.7.1.dev20250724 ,升级2.7.1.dev20250919好了,感谢感谢
moe+mindspeed,跑到grouped_linear会报错,对应的mindspeed需要930的pta配套。ci上同样存在这个问题,需要等待ci更新pta 930的包。 不升级pta的话,可以跟随@wlf-darkmatter 的写法修改。
具体修改参考 https://gitee.com/ascend/MindSpeed/pulls/2791/files mindspeed/te/pytorch/module/grouped_linear.py
moe+mindspeed,跑到grouped_linear会报错,对应的mindspeed需要930的pta配套。ci上同样存在这个问题,需要等待ci更新pta 930的包。 不升级pta的话,可以跟随@wlf-darkmatter 的写法修改。
@tardis-key 请问pta是pytorch-ascend?还是其他库 我没有修改mindspeed代码,torch_npu=2.7.1 release版本跑了4步之后报错了 在export HCCL_OP_EXPANSION_MODE=AIV下跑的
verl日志:
593427 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 159, in forward
593428 output, mlp_bias = custom_forward(hidden_states)
593429 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 146, in custom_forward
593430 expert_output, mlp_bias = self.experts(
593431 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593432 return self._call_impl(*args, **kwargs)
593433 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593434 return forward_call(*args, **kwargs)
593435 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/experts.py", line 760, in forward
593436 intermediate_parallel, bias_parallel = self.linear_fc1(
593437 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593438 return self._call_impl(*args, **kwargs)
593439 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593440 return forward_call(*args, **kwargs)
593441 File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 136, in forward
593442 output = MindSpeedTEGroupedLinearGMM.apply(x, m_splits, group_list_type, self.total_weight, *self.total_weight_T)
593443 File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
593444 return super().apply(*args, **kwargs) # type: ignore[misc]
593445 File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 40, in forward
593446 ctx.group_list = torch.tensor(m_split, device='npu', dtype=torch.int64)
593447 File "/cache/verl_env/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py", line 182, in decorated
593448 return fn(*args, **kwargs)
593449 RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is HcclAlltoAllV.
plog:
[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.868.069 [npu_driver.cc:1457]1846391 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156923907, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.486 [npu_driver.cc:1579]1846391 DevMemAllocManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156792835, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.584 [adapter_rts.cc:652] [1846391][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[231534592 Byte].
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.607 [mem_device.cc:48] [1846391][DeviceMem][Alloc]rt_malloc error, ret[15], size[231534592 Byte]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.614 [hccl_communicator.cc:1483] [1846391][AllocOpBaseModeScratchMem]errNo[0x0000000005000002]ptr [algResResponse.scratchMem.ptr()] is nullptr, return HCCL_E_PTR
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.620 [hccl_communicator_host.cc:5668] [1846391][AllocAlgResource]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.625 [hccl_communicator_host.cc:4330] [1846391][ExecOpAlltoAll]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.672 [hccl_communicator_host.cc:2863] [1846391][AlltoAllVOutPlace]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.679 [hccl_comm.cc:278] [1846391][HcclComm][ALLTOALLV_group_name_136]errNo[0x0000000000000002] index[30325]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.689 [op_base.cc:2702] [1846391][HcclAlltoAllV]call trace: hcclRet -> 2
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.701 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ExecFuncOpApi:452: "[PTA]:"Custom hand fail! name=HcclAlltoAllV, ret=2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.715 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ReadQueue:438: "[PTA]:"---Thread---281267782280864: device = 0, write_idx = 321, read_idx = 305, status = 1, ret = 2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.948.641 [log_inner.cpp:77]1845996 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:MakeSureQueueEmpty:343: "[PTA]:"Inner error happened, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.967.649 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.144 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.164 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.174 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.183 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.201 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.209 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.216 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.223 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
moe+mindspeed,跑到grouped_linear会报错,对应的mindspeed需要930的pta配套。ci上同样存在这个问题,需要等待ci更新pta 930的包。 不升级pta的话,可以跟随@wlf-darkmatter 的写法修改。
@tardis-key 请问pta是pytorch-ascend?还是其他库 我没有修改mindspeed代码,torch_npu=2.7.1 release版本跑了4步之后报错了 在export HCCL_OP_EXPANSION_MODE=AIV下跑的
verl日志:
593427 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 159, in forward 593428 output, mlp_bias = custom_forward(hidden_states) 593429 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 146, in custom_forward 593430 expert_output, mlp_bias = self.experts( 593431 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl 593432 return self._call_impl(*args, **kwargs) 593433 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl 593434 return forward_call(*args, **kwargs) 593435 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/experts.py", line 760, in forward 593436 intermediate_parallel, bias_parallel = self.linear_fc1( 593437 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl 593438 return self._call_impl(*args, **kwargs) 593439 File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl 593440 return forward_call(*args, **kwargs) 593441 File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 136, in forward 593442 output = MindSpeedTEGroupedLinearGMM.apply(x, m_splits, group_list_type, self.total_weight, *self.total_weight_T) 593443 File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply 593444 return super().apply(*args, **kwargs) # type: ignore[misc] 593445 File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 40, in forward 593446 ctx.group_list = torch.tensor(m_split, device='npu', dtype=torch.int64) 593447 File "/cache/verl_env/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py", line 182, in decorated 593448 return fn(*args, **kwargs) 593449 RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is HcclAlltoAllV.plog:
[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.868.069 [npu_driver.cc:1457]1846391 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156923907, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016 [ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.486 [npu_driver.cc:1579]1846391 DevMemAllocManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156792835, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016 [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.584 [adapter_rts.cc:652] [1846391][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[231534592 Byte]. [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.607 [mem_device.cc:48] [1846391][DeviceMem][Alloc]rt_malloc error, ret[15], size[231534592 Byte] [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.614 [hccl_communicator.cc:1483] [1846391][AllocOpBaseModeScratchMem]errNo[0x0000000005000002]ptr [algResResponse.scratchMem.ptr()] is nullptr, return HCCL_E_PTR [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.620 [hccl_communicator_host.cc:5668] [1846391][AllocAlgResource]call trace: hcclRet -> 2 [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.625 [hccl_communicator_host.cc:4330] [1846391][ExecOpAlltoAll]call trace: hcclRet -> 2 [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.672 [hccl_communicator_host.cc:2863] [1846391][AlltoAllVOutPlace]call trace: hcclRet -> 2 [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.679 [hccl_comm.cc:278] [1846391][HcclComm][ALLTOALLV_group_name_136]errNo[0x0000000000000002] index[30325] [ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.689 [op_base.cc:2702] [1846391][HcclAlltoAllV]call trace: hcclRet -> 2 [ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.701 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ExecFuncOpApi:452: "[PTA]:"Custom hand fail! name=HcclAlltoAllV, ret=2"" [ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.715 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ReadQueue:438: "[PTA]:"---Thread---281267782280864: device = 0, write_idx = 321, read_idx = 305, status = 1, ret = 2"" [ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.948.641 [log_inner.cpp:77]1845996 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:MakeSureQueueEmpty:343: "[PTA]:"Inner error happened, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.967.649 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.144 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.164 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.174 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.183 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.201 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.209 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.216 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV"" [ERROR] APP(1844773,):2025-12-02-19:44:45.969.223 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
PTA简单理解就是torch_npu- 比较明显的能看到报错日志中的
driver error:out of memory, 可以尝试降低推理的gpu_utilization。另外,看到是MoE的AllToAllV报错了,可能是负载不均衡导致的,建议尝试缩短序列,或者改变EP切分。
@wlf-darkmatter 嗯非常感谢回复,oom报错地方不一样了没注意,之前日志会报出来到verl日志。 另外关于npu内存我有个问题想再请教下 我修改了 verl/utils/profiler/performance.py的_get_current_mem_info:
def _get_current_mem_info(unit: str = "GB", precision: int = 2) -> tuple[str]:
"""Get current memory usage.
Note that CPU device memory info is always 0.
Args:
unit (str, optional): The unit of memory measurement. Defaults to "GB".
precision (int, optional): The number of decimal places to round memory values. Defaults to 2.
Returns:
tuple[str]: A tuple containing memory allocated, memory reserved, memory used, and memory total
in the specified unit.
"""
assert unit in ["GB", "MB", "KB"]
device = get_torch_device()
# torch.cpu.memory_allocated() does not exist
if device == torch.cpu:
return "0.00", "0.00", "0.00", "0.00", "0.00", "0.00"
divisor = 1024**3 if unit == "GB" else 1024**2 if unit == "MB" else 1024
mem_allocated = get_torch_device().memory_allocated()
max_memory_allocated = get_torch_device().max_memory_allocated()
mem_reserved = get_torch_device().memory_reserved()
max_memory_reserved = get_torch_device().max_memory_reserved()
# use get_torch_device().mem_get_info to profile device memory
# since vllm's sleep mode works below pytorch
# see https://github.com/vllm-project/vllm/pull/11743#issuecomment-2754338119
mem_free, mem_total = get_torch_device().mem_get_info()
mem_used = mem_total - mem_free
mem_allocated = f"{mem_allocated / divisor:.{precision}f}"
max_memory_allocated = f"{max_memory_allocated / divisor:.{precision}f}"
mem_reserved = f"{mem_reserved / divisor:.{precision}f}"
max_memory_reserved = f"{max_memory_reserved / divisor:.{precision}f}"
mem_used = f"{mem_used / divisor:.{precision}f}"
mem_total = f"{mem_total / divisor:.{precision}f}"
get_torch_device().reset_peak_memory_stats()
return mem_allocated, max_memory_allocated, mem_reserved, max_memory_reserved, mem_used, mem_total
添加了max_memory_reserved、max_memory_allocated,现在升级了verl和torch_npu版本后变的不准确了 比如打印:
[/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 40.29, max memory allocated (GB): 57.37, memory reserved (GB): 44.35, max memory reserved (GB): 58.85, device memory used/total (GB): 9.79/60.96
以前memory allocated是torch当前使用的显存,max memory allocated是与上一个打印期间最大的内存使用,而memory reserved约等于device memory used,我会按照memory allocated、max memory allocated去调整模型切分策略 但现在变得不准了,memory allocated 40g明显大于device memory used,在After update_actor这个节点,实际使用的内存应该是device memory used的9.79g,但其他的不准确后失去了max memory allocated的指标,您知道什么原因吗
由于vllm手动控制了相关的内存分配,即使vllm释放了相关的显存,torch也感知不到,因此使用torch的接口打印的显存信息都是失真的。这一现象从旧版本到新版本都是如此,不是由于版本更新造成的。极端情况下甚至能观测到 memory allocated 的显存大于设备最大上限。
建议切分通过训练时候的实际 npu-smi info打印来判断
好的 我又漏看了,注释中有相关解释,那device memory used/total (GB): 9.79/60.96这个指标是准确的
mem_allocated = get_torch_device().memory_allocated()
max_memory_allocated = get_torch_device().max_memory_allocated()
mem_reserved = get_torch_device().memory_reserved()
max_memory_reserved = get_torch_device().max_memory_reserved()
# use get_torch_device().mem_get_info to profile device memory
# since vllm's sleep mode works below pytorch
# see https://github.com/vllm-project/vllm/pull/11743#issuecomment-2754338119
mem_free, mem_total = get_torch_device().mem_get_info()
@wlf-darkmatter 有一个mindspeed训练之后部分显存没释放的问题想请教下,第1个step时,compute_log_prob Before compute_log_prob时的显存是device memory used/total (GB): 3.37/60.96,而训练完一步之后,变成了device memory used/total (GB): 10.33/60.96,多了7G,发现在megatron actor Before compute_log_prob到update_actor After update_actor之间显存就有7g残留,而rollout正常的,Before rollout offload和After rollout offload都是释放46g左右显存。 这次训练是到第三次rollout加载模型和kvcache报显存不够了,当然也可以降低gpu_utilization,但我想预留更多显存给kvcache,所以想问下显存未释放的原因是什么
下面是这个worker的日志:
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:09:31,091 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before init actor model and optimizer, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 0.34/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '21.24 GB', 'free': '1486.90 GB', 'shared': '0.12 GB', 'buff/cache': '2.96 GB', 'available': '1484.52 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:15,909 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After MegatronPPOActor init, memory allocated (GB): 20.56, max memory allocated (GB): 26.75, memory reserved (GB): 27.86, max memory reserved (GB): 27.86, device memory used/total (GB): 29.17/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '23.70 GB', 'free': '1433.17 GB', 'shared': '0.11 GB', 'buff/cache': '54.23 GB', 'available': '1481.57 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:47,565 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After actor optimizer init, memory allocated (GB): 20.56, max memory allocated (GB): 20.85, memory reserved (GB): 27.86, max memory reserved (GB): 27.86, device memory used/total (GB): 29.17/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '348.96 GB', 'free': '1107.64 GB', 'shared': '0.11 GB', 'buff/cache': '54.50 GB', 'available': '1156.23 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:55,492 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during init, memory allocated (GB): 0.00, max memory allocated (GB): 20.56, memory reserved (GB): 0.00, max memory reserved (GB): 27.86, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.87 GB', 'free': '1139.70 GB', 'shared': '0.11 GB', 'buff/cache': '54.53 GB', 'available': '1188.32 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:55,626 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After MegatronPPOActor init, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.88 GB', 'free': '1139.69 GB', 'shared': '0.11 GB', 'buff/cache': '54.53 GB', 'available': '1188.31 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:57,033 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before building vllm rollout, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,817 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After building vllm rollout, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,819 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout init, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,844 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After init_model finish, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:48,943 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.20, memory reserved (GB): 46.05, max memory reserved (GB): 46.05, device memory used/total (GB): 47.96/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '320.34 GB', 'free': '1135.61 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '1184.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:53,901 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 46.05, max memory reserved (GB): 46.05, device memory used/total (GB): 2.07/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.85 GB', 'free': '835.10 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.33 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:56,554 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and optimizer during load_checkpoint, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 45.92, max memory reserved (GB): 46.05, device memory used/total (GB): 2.07/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.95 GB', 'free': '835.00 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.23 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:24:00,499 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 8.92/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.97 GB', 'free': '834.98 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.20 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:21,057 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 54.59, memory reserved (GB): 46.31, max memory reserved (GB): 56.26, device memory used/total (GB): 49.68/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.24 GB', 'free': '813.82 GB', 'shared': '0.12 GB', 'buff/cache': '2.04 GB', 'available': '810.98 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:25,558 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 46.31, max memory reserved (GB): 46.31, device memory used/total (GB): 3.79/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.23 GB', 'free': '813.82 GB', 'shared': '0.12 GB', 'buff/cache': '2.04 GB', 'available': '810.97 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:27,467 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob Before compute_log_prob, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 45.92, max memory reserved (GB): 46.31, device memory used/total (GB): 3.37/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.27 GB', 'free': '813.76 GB', 'shared': '0.13 GB', 'buff/cache': '2.06 GB', 'available': '810.92 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:28,561 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during compute_log_prob, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 10.23/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.36 GB', 'free': '813.67 GB', 'shared': '0.14 GB', 'buff/cache': '2.06 GB', 'available': '810.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:28,563 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before compute_log_prob, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 10.23/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.36 GB', 'free': '813.67 GB', 'shared': '0.14 GB', 'buff/cache': '2.06 GB', 'available': '810.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:10,222 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After compute_log_prob, memory allocated (GB): 52.00, max memory allocated (GB): 52.23, memory reserved (GB): 52.82, max memory reserved (GB): 53.57, device memory used/total (GB): 10.92/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '697.01 GB', 'free': '811.87 GB', 'shared': '0.14 GB', 'buff/cache': '2.21 GB', 'available': '809.10 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:15,506 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 52.00, memory reserved (GB): 52.16, max memory reserved (GB): 52.82, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '775.16 GB', 'free': '733.72 GB', 'shared': '0.14 GB', 'buff/cache': '2.21 GB', 'available': '730.95 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:16,461 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob After compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.16, max memory reserved (GB): 52.16, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '768.99 GB', 'free': '739.88 GB', 'shared': '0.14 GB', 'buff/cache': '2.22 GB', 'available': '737.11 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:18,670 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor Before update_actor, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.16, max memory reserved (GB): 52.16, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.07 GB', 'free': '739.78 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '737.01 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:19,701 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during update_actor, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.72, max memory reserved (GB): 70.72, device memory used/total (GB): 28.81/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.08 GB', 'free': '739.76 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '736.99 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:19,704 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before update_policy, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.72, max memory reserved (GB): 70.72, device memory used/total (GB): 28.81/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.08 GB', 'free': '739.76 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '736.99 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:30,284 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After update_policy, memory allocated (GB): 65.82, max memory allocated (GB): 67.15, memory reserved (GB): 70.78, max memory reserved (GB): 70.79, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1091.12 GB', 'free': '417.32 GB', 'shared': '0.17 GB', 'buff/cache': '2.66 GB', 'available': '414.75 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:36,756 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 65.82, memory reserved (GB): 52.21, max memory reserved (GB): 70.78, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1145.75 GB', 'free': '362.68 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '360.12 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:37,710 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1163.88 GB', 'free': '344.54 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '341.98 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:49:00,196 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.45 GB', 'free': '356.98 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '354.42 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:02,421 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.26, max memory allocated (GB): 54.75, memory reserved (GB): 52.21, max memory reserved (GB): 58.40, device memory used/total (GB): 56.30/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.36 GB', 'free': '357.56 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.75 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:07,549 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.41/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.37 GB', 'free': '357.56 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.74 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:09,067 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob Before compute_log_prob, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.38 GB', 'free': '357.55 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.74 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:10,090 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during compute_log_prob, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.41 GB', 'free': '357.52 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.71 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:10,091 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before compute_log_prob, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.41 GB', 'free': '357.52 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.71 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:37,072 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After compute_log_prob, memory allocated (GB): 52.00, max memory allocated (GB): 52.26, memory reserved (GB): 58.33, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.42 GB', 'free': '357.27 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '354.58 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:42,514 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 52.00, memory reserved (GB): 52.14, max memory reserved (GB): 58.33, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.55 GB', 'free': '293.14 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.45 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:43,469 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob After compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.14, max memory reserved (GB): 52.14, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.55 GB', 'free': '293.14 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.45 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:48,021 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor Before update_actor, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.14, max memory reserved (GB): 52.14, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.76 GB', 'free': '292.93 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.24 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:49,494 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during update_actor, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.70, max memory reserved (GB): 70.70, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.84 GB', 'free': '292.85 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.16 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:49,496 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before update_policy, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.70, max memory reserved (GB): 70.70, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.84 GB', 'free': '292.85 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.16 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:10,047 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After update_policy, memory allocated (GB): 65.82, max memory allocated (GB): 67.15, memory reserved (GB): 70.75, max memory reserved (GB): 70.76, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1240.92 GB', 'free': '267.71 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '265.05 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:16,195 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 65.82, memory reserved (GB): 52.19, max memory reserved (GB): 70.75, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1216.02 GB', 'free': '292.61 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '289.95 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:17,163 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.19, max memory reserved (GB): 52.19, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1217.36 GB', 'free': '291.27 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '288.61 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:38,223 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.38, max memory reserved (GB): 58.38, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1216.05 GB', 'free': '292.58 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '289.92 GB'}