verl mindspeed train error: group_type currently only support -1 and 0, current value is 2

我使用verl 1030的main代码，按 Dockerfile.ascend_8.2.rc1_a2安装环境跑recipe/dapo/run_dapo_qwen3_moe_30b_megatron_npu.sh，初始化、推理均已完成，但训练时报错：

ray.exceptions.RayTaskError(RuntimeError): [36mray::WorkerDict.actor_rollout_update_actor()[39m (pid=490071, ip=172.16.2.11, actor_id=dee0d43a6f32372ec4ff655e04000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0xffcf8e378eb0>)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/single_controller/ray/base.py", line 700, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/single_controller/base/decorator.py", line 442, in inner
    return func(*args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/transferqueue_utils.py", line 199, in dummy_inner
    return func(*args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 105, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 118, in log
    output = func(*args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/profile.py", line 256, in wrapper
    return func(self_instance, *args, **kwargs_inner)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/megatron_workers.py", line 632, in update_actor
    metrics = self.actor.update_policy(dataloader=dataloader)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 105, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/utils/profiler/performance.py", line 118, in log
    output = func(*args, **kwargs)
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/actor/megatron_actor.py", line 648, in update_policy
    metric_micro_batch = self.forward_backward_batch(
  File "/cache/ray_temp/session_2025-10-31_17-44-09_992100_1145212/runtime_resources/working_dir_files/_ray_pkg_d85728c4d7bda8f2/verl/workers/actor/megatron_actor.py", line 586, in forward_backward_batch
    losses_reduced = forward_backward_func(
  File "/cache/algo/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1932, in forward_backward_pipelining_without_interleaving
    input_tensor_grad = backward_step(
  File "/cache/algo/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 395, in backward_step
    torch.autograd.backward(output_tensor[0], grad_tensors=output_tensor_grad[0])
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
    _engine_run_backward(
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
  File "/cache/algo/Megatron-LM/megatron/core/tensor_parallel/random.py", line 455, in backward
    torch.autograd.backward(outputs, args)
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
    _engine_run_backward(
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
  File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 60, in backward
    grad_weight = torch_npu.npu_grouped_matmul([inp.T], [grad_output], bias=None, group_list=group_list,
  File "/cache/verl_env/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: group_type currently only support -1 and 0, current value is 2

另外升级了vllm，vllm0.11.0+vllm0.11.0rc0+torch2.7.1，也是报一样的错，而使用fsdp正常训练

Nov 03 '25 04:11 glowwormX

@Shangwei-Li @wlf-darkmatter 可以帮忙看下么

Nov 03 '25 08:11 glowwormX

@Shangwei-Li @wlf-darkmatter 可以帮忙看下么

两个办法

建议更新torch_npu的版本，需要是9月5日后的
或者修改mindspeed，回退这个修改

Nov 03 '25 08:11 wlf-darkmatter

torch_npu之前是2.7.1.dev20250724 ，升级2.7.1.dev20250919好了，感谢感谢

Nov 03 '25 09:11 glowwormX

moe+mindspeed，跑到grouped_linear会报错，对应的mindspeed需要930的pta配套。ci上同样存在这个问题，需要等待ci更新pta 930的包。不升级pta的话，可以跟随@wlf-darkmatter 的写法修改。

Nov 12 '25 08:11 tardis-key

具体修改参考 https://gitee.com/ascend/MindSpeed/pulls/2791/files mindspeed/te/pytorch/module/grouped_linear.py

Nov 12 '25 08:11 wlf-darkmatter

moe+mindspeed，跑到grouped_linear会报错，对应的mindspeed需要930的pta配套。ci上同样存在这个问题，需要等待ci更新pta 930的包。不升级pta的话，可以跟随@wlf-darkmatter 的写法修改。

@tardis-key 请问pta是pytorch-ascend？还是其他库我没有修改mindspeed代码，torch_npu=2.7.1 release版本跑了4步之后报错了在export HCCL_OP_EXPANSION_MODE=AIV下跑的

verl日志：

593427	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 159, in forward
593428	    output, mlp_bias = custom_forward(hidden_states)
593429	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 146, in custom_forward
593430	    expert_output, mlp_bias = self.experts(
593431	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593432	    return self._call_impl(*args, **kwargs)
593433	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593434	    return forward_call(*args, **kwargs)
593435	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/experts.py", line 760, in forward
593436	    intermediate_parallel, bias_parallel = self.linear_fc1(
593437	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593438	    return self._call_impl(*args, **kwargs)
593439	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593440	    return forward_call(*args, **kwargs)
593441	  File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 136, in forward
593442	    output = MindSpeedTEGroupedLinearGMM.apply(x, m_splits, group_list_type, self.total_weight, *self.total_weight_T)
593443	  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
593444	    return super().apply(*args, **kwargs)  # type: ignore[misc]
593445	  File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 40, in forward
593446	    ctx.group_list = torch.tensor(m_split, device='npu', dtype=torch.int64)
593447	  File "/cache/verl_env/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py", line 182, in decorated
593448	    return fn(*args, **kwargs)
593449	RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is HcclAlltoAllV.

plog:

[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.868.069 [npu_driver.cc:1457]1846391 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156923907, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.486 [npu_driver.cc:1579]1846391 DevMemAllocManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156792835, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.584 [adapter_rts.cc:652] [1846391][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[231534592 Byte].
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.607 [mem_device.cc:48] [1846391][DeviceMem][Alloc]rt_malloc error, ret[15], size[231534592 Byte]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.614 [hccl_communicator.cc:1483] [1846391][AllocOpBaseModeScratchMem]errNo[0x0000000005000002]ptr [algResResponse.scratchMem.ptr()] is nullptr, return HCCL_E_PTR
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.620 [hccl_communicator_host.cc:5668] [1846391][AllocAlgResource]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.625 [hccl_communicator_host.cc:4330] [1846391][ExecOpAlltoAll]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.672 [hccl_communicator_host.cc:2863] [1846391][AlltoAllVOutPlace]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.679 [hccl_comm.cc:278] [1846391][HcclComm][ALLTOALLV_group_name_136]errNo[0x0000000000000002] index[30325]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.689 [op_base.cc:2702] [1846391][HcclAlltoAllV]call trace: hcclRet -> 2
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.701 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ExecFuncOpApi:452: "[PTA]:"Custom hand fail! name=HcclAlltoAllV, ret=2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.715 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ReadQueue:438: "[PTA]:"---Thread---281267782280864: device = 0, write_idx = 321, read_idx = 305, status = 1, ret = 2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.948.641 [log_inner.cpp:77]1845996 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:MakeSureQueueEmpty:343: "[PTA]:"Inner error happened, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.967.649 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.144 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.164 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.174 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.183 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.201 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.209 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.216 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.223 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""

Dec 02 '25 13:12 glowwormX

moe+mindspeed，跑到grouped_linear会报错，对应的mindspeed需要930的pta配套。ci上同样存在这个问题，需要等待ci更新pta 930的包。不升级pta的话，可以跟随@wlf-darkmatter 的写法修改。

@tardis-key 请问pta是pytorch-ascend？还是其他库我没有修改mindspeed代码，torch_npu=2.7.1 release版本跑了4步之后报错了在export HCCL_OP_EXPANSION_MODE=AIV下跑的

verl日志：

593427	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 159, in forward
593428	    output, mlp_bias = custom_forward(hidden_states)
593429	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 146, in custom_forward
593430	    expert_output, mlp_bias = self.experts(
593431	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593432	    return self._call_impl(*args, **kwargs)
593433	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593434	    return forward_call(*args, **kwargs)
593435	  File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/experts.py", line 760, in forward
593436	    intermediate_parallel, bias_parallel = self.linear_fc1(
593437	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
593438	    return self._call_impl(*args, **kwargs)
593439	  File "/cache/verl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
593440	    return forward_call(*args, **kwargs)
593441	  File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 136, in forward
593442	    output = MindSpeedTEGroupedLinearGMM.apply(x, m_splits, group_list_type, self.total_weight, *self.total_weight_T)
593443	  File "/cache/verl_env/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
593444	    return super().apply(*args, **kwargs)  # type: ignore[misc]
593445	  File "/cache/algo/MindSpeed/mindspeed/te/pytorch/module/grouped_linear.py", line 40, in forward
593446	    ctx.group_list = torch.tensor(m_split, device='npu', dtype=torch.int64)
593447	  File "/cache/verl_env/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py", line 182, in decorated
593448	    return fn(*args, **kwargs)
593449	RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is HcclAlltoAllV.

plog:

[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.868.069 [npu_driver.cc:1457]1846391 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156923907, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.486 [npu_driver.cc:1579]1846391 DevMemAllocManaged:[drv api] halMemAlloc failed:size=231534592(bytes), type=16, moduleId=3, drvFlag=216172782156792835, drvRetCode=6, device_id=3, ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.584 [adapter_rts.cc:652] [1846391][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[231534592 Byte].
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.607 [mem_device.cc:48] [1846391][DeviceMem][Alloc]rt_malloc error, ret[15], size[231534592 Byte]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.614 [hccl_communicator.cc:1483] [1846391][AllocOpBaseModeScratchMem]errNo[0x0000000005000002]ptr [algResResponse.scratchMem.ptr()] is nullptr, return HCCL_E_PTR
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.620 [hccl_communicator_host.cc:5668] [1846391][AllocAlgResource]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.625 [hccl_communicator_host.cc:4330] [1846391][ExecOpAlltoAll]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.672 [hccl_communicator_host.cc:2863] [1846391][AlltoAllVOutPlace]call trace: hcclRet -> 2
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.679 [hccl_comm.cc:278] [1846391][HcclComm][ALLTOALLV_group_name_136]errNo[0x0000000000000002] index[30325]
[ERROR] HCCL(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.689 [op_base.cc:2702] [1846391][HcclAlltoAllV]call trace: hcclRet -> 2
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.701 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ExecFuncOpApi:452: "[PTA]:"Custom hand fail! name=HcclAlltoAllV, ret=2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.946.715 [log_inner.cpp:77]1846391 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ReadQueue:438: "[PTA]:"---Thread---281267782280864: device = 0, write_idx = 321, read_idx = 305, status = 1, ret = 2""
[ERROR] APP(1844773,r_rollout_compute_log_prob):2025-12-02-19:44:45.948.641 [log_inner.cpp:77]1845996 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:MakeSureQueueEmpty:343: "[PTA]:"Inner error happened, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.967.649 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.144 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.164 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.174 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.183 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.201 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.209 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.216 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""
[ERROR] APP(1844773,):2025-12-02-19:44:45.969.223 [log_inner.cpp:77]1846395 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:Enqueue:527: "[PTA]:"Inner error happened with CAN_EXIT status, detail: the current working operator name is HcclAlltoAllV""

PTA 简单理解就是 torch_npu
比较明显的能看到报错日志中的 driver error:out of memory，可以尝试降低推理的 gpu_utilization。另外，看到是MoE的AllToAllV报错了，可能是负载不均衡导致的，建议尝试缩短序列，或者改变EP切分。

Dec 02 '25 13:12 wlf-darkmatter

@wlf-darkmatter 嗯非常感谢回复，oom报错地方不一样了没注意，之前日志会报出来到verl日志。另外关于npu内存我有个问题想再请教下我修改了 verl/utils/profiler/performance.py的_get_current_mem_info：

def _get_current_mem_info(unit: str = "GB", precision: int = 2) -> tuple[str]:
    """Get current memory usage.

    Note that CPU device memory info is always 0.

    Args:
        unit (str, optional): The unit of memory measurement. Defaults to "GB".
        precision (int, optional): The number of decimal places to round memory values. Defaults to 2.

    Returns:
        tuple[str]: A tuple containing memory allocated, memory reserved, memory used, and memory total
        in the specified unit.
    """
    assert unit in ["GB", "MB", "KB"]
    device = get_torch_device()
    # torch.cpu.memory_allocated() does not exist
    if device == torch.cpu:
        return "0.00", "0.00", "0.00", "0.00", "0.00", "0.00"

    divisor = 1024**3 if unit == "GB" else 1024**2 if unit == "MB" else 1024
    mem_allocated = get_torch_device().memory_allocated()
    max_memory_allocated = get_torch_device().max_memory_allocated()
    mem_reserved = get_torch_device().memory_reserved()
    max_memory_reserved = get_torch_device().max_memory_reserved()
    # use get_torch_device().mem_get_info to profile device memory
    # since vllm's sleep mode works below pytorch
    # see https://github.com/vllm-project/vllm/pull/11743#issuecomment-2754338119
    mem_free, mem_total = get_torch_device().mem_get_info()
    mem_used = mem_total - mem_free
    mem_allocated = f"{mem_allocated / divisor:.{precision}f}"
    max_memory_allocated = f"{max_memory_allocated / divisor:.{precision}f}"
    mem_reserved = f"{mem_reserved / divisor:.{precision}f}"
    max_memory_reserved = f"{max_memory_reserved / divisor:.{precision}f}"
    mem_used = f"{mem_used / divisor:.{precision}f}"
    mem_total = f"{mem_total / divisor:.{precision}f}"
    get_torch_device().reset_peak_memory_stats()
    return mem_allocated, max_memory_allocated, mem_reserved, max_memory_reserved, mem_used, mem_total

添加了max_memory_reserved、max_memory_allocated，现在升级了verl和torch_npu版本后变的不准确了比如打印：

[/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 40.29, max memory allocated (GB): 57.37, memory reserved (GB): 44.35, max memory reserved (GB): 58.85, device memory used/total (GB): 9.79/60.96

以前memory allocated是torch当前使用的显存，max memory allocated是与上一个打印期间最大的内存使用，而memory reserved约等于device memory used，我会按照memory allocated、max memory allocated去调整模型切分策略但现在变得不准了，memory allocated 40g明显大于device memory used，在After update_actor这个节点，实际使用的内存应该是device memory used的9.79g，但其他的不准确后失去了max memory allocated的指标，您知道什么原因吗

Dec 03 '25 01:12 glowwormX

由于vllm手动控制了相关的内存分配，即使vllm释放了相关的显存，torch也感知不到，因此使用torch的接口打印的显存信息都是失真的。这一现象从旧版本到新版本都是如此，不是由于版本更新造成的。极端情况下甚至能观测到 memory allocated 的显存大于设备最大上限。

建议切分通过训练时候的实际 npu-smi info打印来判断

Dec 03 '25 02:12 wlf-darkmatter

好的我又漏看了，注释中有相关解释，那device memory used/total (GB): 9.79/60.96这个指标是准确的

    mem_allocated = get_torch_device().memory_allocated()
    max_memory_allocated = get_torch_device().max_memory_allocated()
    mem_reserved = get_torch_device().memory_reserved()
    max_memory_reserved = get_torch_device().max_memory_reserved()
    # use get_torch_device().mem_get_info to profile device memory
    # since vllm's sleep mode works below pytorch
    # see https://github.com/vllm-project/vllm/pull/11743#issuecomment-2754338119
    mem_free, mem_total = get_torch_device().mem_get_info()

Dec 03 '25 02:12 glowwormX

@wlf-darkmatter 有一个mindspeed训练之后部分显存没释放的问题想请教下，第1个step时，compute_log_prob Before compute_log_prob时的显存是device memory used/total (GB): 3.37/60.96，而训练完一步之后，变成了device memory used/total (GB): 10.33/60.96，多了7G，发现在megatron actor Before compute_log_prob到update_actor After update_actor之间显存就有7g残留，而rollout正常的，Before rollout offload和After rollout offload都是释放46g左右显存。这次训练是到第三次rollout加载模型和kvcache报显存不够了，当然也可以降低gpu_utilization，但我想预留更多显存给kvcache，所以想问下显存未释放的原因是什么

下面是这个worker的日志：

[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:09:31,091 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before init actor model and optimizer, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 0.34/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '21.24 GB', 'free': '1486.90 GB', 'shared': '0.12 GB', 'buff/cache': '2.96 GB', 'available': '1484.52 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:15,909 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After MegatronPPOActor init, memory allocated (GB): 20.56, max memory allocated (GB): 26.75, memory reserved (GB): 27.86, max memory reserved (GB): 27.86, device memory used/total (GB): 29.17/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '23.70 GB', 'free': '1433.17 GB', 'shared': '0.11 GB', 'buff/cache': '54.23 GB', 'available': '1481.57 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:47,565 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After actor optimizer init, memory allocated (GB): 20.56, max memory allocated (GB): 20.85, memory reserved (GB): 27.86, max memory reserved (GB): 27.86, device memory used/total (GB): 29.17/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '348.96 GB', 'free': '1107.64 GB', 'shared': '0.11 GB', 'buff/cache': '54.50 GB', 'available': '1156.23 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:55,492 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during init, memory allocated (GB): 0.00, max memory allocated (GB): 20.56, memory reserved (GB): 0.00, max memory reserved (GB): 27.86, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.87 GB', 'free': '1139.70 GB', 'shared': '0.11 GB', 'buff/cache': '54.53 GB', 'available': '1188.32 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:55,626 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After MegatronPPOActor init, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.88 GB', 'free': '1139.69 GB', 'shared': '0.11 GB', 'buff/cache': '54.53 GB', 'available': '1188.31 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:57,033 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before building vllm rollout, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,817 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After building vllm rollout, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,819 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout init, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:21:59,844 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After init_model finish, memory allocated (GB): 0.00, max memory allocated (GB): 0.00, memory reserved (GB): 0.00, max memory reserved (GB): 0.00, device memory used/total (GB): 1.43/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '316.85 GB', 'free': '1139.71 GB', 'shared': '0.11 GB', 'buff/cache': '54.54 GB', 'available': '1188.34 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:48,943 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.20, memory reserved (GB): 46.05, max memory reserved (GB): 46.05, device memory used/total (GB): 47.96/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '320.34 GB', 'free': '1135.61 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '1184.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:53,901 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 46.05, max memory reserved (GB): 46.05, device memory used/total (GB): 2.07/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.85 GB', 'free': '835.10 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.33 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:23:56,554 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and optimizer during load_checkpoint, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 45.92, max memory reserved (GB): 46.05, device memory used/total (GB): 2.07/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.95 GB', 'free': '835.00 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.23 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:24:00,499 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 8.92/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '620.97 GB', 'free': '834.98 GB', 'shared': '0.11 GB', 'buff/cache': '55.15 GB', 'available': '884.20 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:21,057 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 54.59, memory reserved (GB): 46.31, max memory reserved (GB): 56.26, device memory used/total (GB): 49.68/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.24 GB', 'free': '813.82 GB', 'shared': '0.12 GB', 'buff/cache': '2.04 GB', 'available': '810.98 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:25,558 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 46.31, max memory reserved (GB): 46.31, device memory used/total (GB): 3.79/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.23 GB', 'free': '813.82 GB', 'shared': '0.12 GB', 'buff/cache': '2.04 GB', 'available': '810.97 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:27,467 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob Before compute_log_prob, memory allocated (GB): 45.11, max memory allocated (GB): 45.11, memory reserved (GB): 45.92, max memory reserved (GB): 46.31, device memory used/total (GB): 3.37/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.27 GB', 'free': '813.76 GB', 'shared': '0.13 GB', 'buff/cache': '2.06 GB', 'available': '810.92 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:28,561 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during compute_log_prob, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 10.23/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.36 GB', 'free': '813.67 GB', 'shared': '0.14 GB', 'buff/cache': '2.06 GB', 'available': '810.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:45:28,563 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before compute_log_prob, memory allocated (GB): 51.96, max memory allocated (GB): 51.96, memory reserved (GB): 52.77, max memory reserved (GB): 52.77, device memory used/total (GB): 10.23/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '695.36 GB', 'free': '813.67 GB', 'shared': '0.14 GB', 'buff/cache': '2.06 GB', 'available': '810.83 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:10,222 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After compute_log_prob, memory allocated (GB): 52.00, max memory allocated (GB): 52.23, memory reserved (GB): 52.82, max memory reserved (GB): 53.57, device memory used/total (GB): 10.92/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '697.01 GB', 'free': '811.87 GB', 'shared': '0.14 GB', 'buff/cache': '2.21 GB', 'available': '809.10 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:15,506 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 52.00, memory reserved (GB): 52.16, max memory reserved (GB): 52.82, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '775.16 GB', 'free': '733.72 GB', 'shared': '0.14 GB', 'buff/cache': '2.21 GB', 'available': '730.95 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:16,461 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob After compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.16, max memory reserved (GB): 52.16, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '768.99 GB', 'free': '739.88 GB', 'shared': '0.14 GB', 'buff/cache': '2.22 GB', 'available': '737.11 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:18,670 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor Before update_actor, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.16, max memory reserved (GB): 52.16, device memory used/total (GB): 10.25/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.07 GB', 'free': '739.78 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '737.01 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:19,701 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during update_actor, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.72, max memory reserved (GB): 70.72, device memory used/total (GB): 28.81/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.08 GB', 'free': '739.76 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '736.99 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:46:19,704 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before update_policy, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.72, max memory reserved (GB): 70.72, device memory used/total (GB): 28.81/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '769.08 GB', 'free': '739.76 GB', 'shared': '0.17 GB', 'buff/cache': '2.25 GB', 'available': '736.99 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:30,284 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After update_policy, memory allocated (GB): 65.82, max memory allocated (GB): 67.15, memory reserved (GB): 70.78, max memory reserved (GB): 70.79, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1091.12 GB', 'free': '417.32 GB', 'shared': '0.17 GB', 'buff/cache': '2.66 GB', 'available': '414.75 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:36,756 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 65.82, memory reserved (GB): 52.21, max memory reserved (GB): 70.78, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1145.75 GB', 'free': '362.68 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '360.12 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:48:37,710 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1163.88 GB', 'free': '344.54 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '341.98 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 09:49:00,196 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.45 GB', 'free': '356.98 GB', 'shared': '0.17 GB', 'buff/cache': '2.67 GB', 'available': '354.42 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:02,421 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => Before rollout offload, memory allocated (GB): 45.26, max memory allocated (GB): 54.75, memory reserved (GB): 52.21, max memory reserved (GB): 58.40, device memory used/total (GB): 56.30/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.36 GB', 'free': '357.56 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.75 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:07,549 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After rollout offload, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.41/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.37 GB', 'free': '357.56 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.74 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:09,067 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob Before compute_log_prob, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.21, max memory reserved (GB): 52.21, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.38 GB', 'free': '357.55 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.74 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:10,090 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during compute_log_prob, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.41 GB', 'free': '357.52 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.71 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:10,091 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before compute_log_prob, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.40, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.41 GB', 'free': '357.52 GB', 'shared': '0.18 GB', 'buff/cache': '2.17 GB', 'available': '354.71 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:37,072 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After compute_log_prob, memory allocated (GB): 52.00, max memory allocated (GB): 52.26, memory reserved (GB): 58.33, max memory reserved (GB): 58.40, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1151.42 GB', 'free': '357.27 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '354.58 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:42,514 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 52.00, memory reserved (GB): 52.14, max memory reserved (GB): 58.33, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.55 GB', 'free': '293.14 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.45 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:43,469 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => compute_log_prob After compute_log_prob, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.14, max memory reserved (GB): 52.14, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.55 GB', 'free': '293.14 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.45 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:48,021 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor Before update_actor, memory allocated (GB): 45.14, max memory allocated (GB): 45.14, memory reserved (GB): 52.14, max memory reserved (GB): 52.14, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.76 GB', 'free': '292.93 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.24 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:49,494 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params and grad during update_actor, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.70, max memory reserved (GB): 70.70, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.84 GB', 'free': '292.85 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.16 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:11:49,496 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor Before update_policy, memory allocated (GB): 65.70, max memory allocated (GB): 65.70, memory reserved (GB): 70.70, max memory reserved (GB): 70.70, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1215.84 GB', 'free': '292.85 GB', 'shared': '0.18 GB', 'buff/cache': '2.41 GB', 'available': '290.16 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:10,047 INFO [/cache/algo/verl/verl/workers/actor/megatron_actor.py:89] => megatron actor After update_policy, memory allocated (GB): 65.82, max memory allocated (GB): 67.15, memory reserved (GB): 70.75, max memory reserved (GB): 70.76, device memory used/total (GB): 28.90/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1240.92 GB', 'free': '267.71 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '265.05 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:16,195 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After offload actor params and grad during update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 65.82, memory reserved (GB): 52.19, max memory reserved (GB): 70.75, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1216.02 GB', 'free': '292.61 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '289.95 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:17,163 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:89] => update_actor After update_actor, memory allocated (GB): 45.26, max memory allocated (GB): 45.26, memory reserved (GB): 52.19, max memory reserved (GB): 52.19, device memory used/total (GB): 10.33/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1217.36 GB', 'free': '291.27 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '288.61 GB'}
[36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 | Local Rank 0] 2025-12-04 10:14:38,223 INFO [/cache/algo/verl/verl/workers/megatron_workers.py:103] => After load actor params during rollout_mode, memory allocated (GB): 52.11, max memory allocated (GB): 52.11, memory reserved (GB): 58.38, max memory reserved (GB): 58.38, device memory used/total (GB): 16.52/60.96, cpu_memory: {'total': '1511.10 GB', 'used': '1216.05 GB', 'free': '292.58 GB', 'shared': '0.18 GB', 'buff/cache': '2.47 GB', 'available': '289.92 GB'}

Dec 04 '25 03:12 glowwormX