llp1992

Results 12 comments of llp1992

训练报错:assert cu_seqlens_q == None and cu_seqlens_kv == None megatron sft \ --load ${CKPT_PATH}/QwQ-32B-mcore \ --dataset ${ROOT_PATH}/data/llm/${TRAIN_DATA} \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 4 \ --micro_batch_size 1 \ --global_batch_size 8 \ --recompute_granularity...

> 有报错截图嘛 看看哪里抛出来的 Megatron-LM/megatron/core/transformer/attention.py", line 591, in forward [rank4]: assert cu_seqlens_q == None and cu_seqlens_kv == None [rank4]: AssertionError 需要哪个版本的megatron-LM?

可能确实是megatron-LM版本问题,Qwen3-30B-A3B训练也是报这个错误

why Downloading Model to directory: /mnt/workspace/.cache/modelscope/simon-stub-path ?

Qwen3-30B-A3B训练成功,但Qwen3-32B megatron sft报错: 2025-05-02T03:37:00.069008389Z [rank24]: raise RuntimeError( 2025-05-02T03:37:00.069009658Z [rank24]: torch._dynamo.exc.TorchRuntimeError: Failed running call_function (*(FakeTensor(..., device='cuda:0', size=(90880, 37984)), (FakeTensor(..., device='cuda:0', size=(90880,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(90752,), dtype=torch.int64))), **{}): 2025-05-02T03:37:00.069011338Z [rank24]: Attempting to...

> 可以看看是哪里抛出来的嘛,报错信息完整一些,最好是截图 Qwen3的dense模型,megatron训练都会报这个错

> 有swift的报错栈嘛,这里全是torch的 要不你们跑下试试?