ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Megatron-SWIFT训练导出32B模型显存报错

Open zhangtianhong-1998 opened this issue 8 months ago • 10 comments

Describe the bug CUDA_VISIBLE_DEVICES=0
swift export
--model Qwen/QwQ-32B
--to_mcore true
--torch_dtype bfloat16
--test_convert_precision true
--output_dir Qwen/QwQ-32B-mcore

Your hardware and system info H100*80

Additional context 是否有多卡导出的可能还是需要配置 其他参数,没找到相关文档

zhangtianhong-1998 avatar Apr 04 '25 14:04 zhangtianhong-1998

贪心一点能否请求一个Megatron-SWIFT 8*H100 训个 32b的模型的脚本 一下卡住了

zhangtianhong-1998 avatar Apr 04 '25 14:04 zhangtianhong-1998

--test_convert_precision true 这行去掉就好了

Jintao-Huang avatar Apr 05 '25 00:04 Jintao-Huang

转换后,运行停滞,按照以下命令进行允许,这有可能是什么问题 [before the start of training step] datetime: 2025-04-05 10:10:38 停留在这个logo

NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--load Qwen/QwQ-32B-mcore
--dataset 'swift/self-cognition#500'
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 4
--micro_batch_size 1
--global_batch_size 16
--recompute_granularity selective
--train_iters 1000
--eval_iters 10
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_iters 50
--min_lr 1e-6
--save megatron_output/QwQ-32B-mcore
--save_interval 100
--max_length 32768
--system 'You are a helpful assistant.'
--num_workers 8
--no_save_optim true
--no_save_rng true
--dataset_num_proc 8
--model_author swift
--model_name swift-robot

zhangtianhong-1998 avatar Apr 05 '25 03:04 zhangtianhong-1998

pip install py-spy

py-spy dump --pid xxx 看看卡在哪里了

Jintao-Huang avatar Apr 06 '25 05:04 Jintao-Huang

训练报错:assert cu_seqlens_q == None and cu_seqlens_kv == None

megatron sft
--load ${CKPT_PATH}/QwQ-32B-mcore
--dataset ${ROOT_PATH}/data/llm/${TRAIN_DATA}
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 4
--micro_batch_size 1
--global_batch_size 8
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 8
--train_iters 100
--eval_iters 50
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_iters 10
--min_lr 1e-6
--save ${ROOT_PATH}/megatron_output/${MODEL}/${RUN_NAME}
--save_interval 100
--max_length 16000
--system 'You are a helpful assistant.'
--num_workers 4
--no_save_optim true
--no_save_rng true
--dataset_num_proc 4
--model_author swift
--model_name swift-robot
--attn_impl flash_attn
--packing true
--sequence_parallel true
--use_flash_attn true \

llp1992 avatar Apr 30 '25 15:04 llp1992

有报错截图嘛 看看哪里抛出来的

Jintao-Huang avatar Apr 30 '25 15:04 Jintao-Huang

检查一下megatron-LM的版本

建议直接使用swift的镜像

Jintao-Huang avatar Apr 30 '25 15:04 Jintao-Huang

有报错截图嘛 看看哪里抛出来的 Megatron-LM/megatron/core/transformer/attention.py", line 591, in forward [rank4]: assert cu_seqlens_q == None and cu_seqlens_kv == None [rank4]: AssertionError

需要哪个版本的megatron-LM?

llp1992 avatar Apr 30 '25 15:04 llp1992

可能确实是megatron-LM版本问题,Qwen3-30B-A3B训练也是报这个错误

llp1992 avatar Apr 30 '25 15:04 llp1992

https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

这里有介绍

Jintao-Huang avatar Apr 30 '25 16:04 Jintao-Huang