glowwormX
glowwormX
@fzp0424 您好 请问问题有解决吗
@Shangwei-Li @wlf-darkmatter 可以帮忙看下么
torch_npu之前是2.7.1.dev20250724 ,升级2.7.1.dev20250919好了,感谢感谢
> moe+mindspeed,跑到grouped_linear会报错,对应的mindspeed需要930的pta配套。ci上同样存在这个问题,需要等待ci更新pta 930的包。 不升级pta的话,可以跟随[@wlf-darkmatter](https://github.com/wlf-darkmatter) 的写法修改。 @tardis-key 请问pta是pytorch-ascend?还是其他库 我没有修改mindspeed代码,torch_npu=2.7.1 release版本跑了4步之后报错了 在export HCCL_OP_EXPANSION_MODE=AIV下跑的 verl日志: ``` 593427 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 159, in forward 593428 output, mlp_bias = custom_forward(hidden_states) 593429 File "/cache/algo/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 146,...
@wlf-darkmatter 嗯非常感谢回复,oom报错地方不一样了没注意,之前日志会报出来到verl日志。 另外关于npu内存我有个问题想再请教下 我修改了 verl/utils/profiler/performance.py的_get_current_mem_info: ```python def _get_current_mem_info(unit: str = "GB", precision: int = 2) -> tuple[str]: """Get current memory usage. Note that CPU device memory info is always 0....
好的 我又漏看了,注释中有相关解释,那device memory used/total (GB): 9.79/60.96这个指标是准确的 ``` mem_allocated = get_torch_device().memory_allocated() max_memory_allocated = get_torch_device().max_memory_allocated() mem_reserved = get_torch_device().memory_reserved() max_memory_reserved = get_torch_device().max_memory_reserved() # use get_torch_device().mem_get_info to profile device memory # since vllm's sleep...
@wlf-darkmatter 有一个mindspeed训练之后部分显存没释放的问题想请教下,第1个step时,compute_log_prob Before compute_log_prob时的显存是device memory used/total (GB): 3.37/60.96,而训练完一步之后,变成了device memory used/total (GB): 10.33/60.96,多了7G,发现在megatron actor Before compute_log_prob到update_actor After update_actor之间显存就有7g残留,而rollout正常的,Before rollout offload和After rollout offload都是释放46g左右显存。 这次训练是到第三次rollout加载模型和kvcache报显存不够了,当然也可以降低gpu_utilization,但我想预留更多显存给kvcache,所以想问下显存未释放的原因是什么 下面是这个worker的日志: ```python [36m(WorkerDict pid=393228, ip=172.16.0.197)[0m [Rank 0 |...
@zheliuyu Can you help me take a look?
@KK-277 Are you looking at this issue? Is there any progress?
@1k77 Okay, I'll give it a try. Did you manage to reproduce it on v0.11.0?