CPM-Bee icon indicating copy to clipboard operation
CPM-Bee copied to clipboard

单机多卡加载模型时卡住

Open juexingyezuile opened this issue 1 year ago • 3 comments

torch 1.13 cuda 11.7

推理代码能正常运行

训练开4卡4090,加载模型时卡住,cpu占用100%,显卡占用100%

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 finetune_cpm_bee.py --use-delta --model-config config/cpm-bee-10b.json --dataset datasets/eprstmt/binary/dev --eval_dataset datasets/eprstmt/binary/eval_dev --epoch 100 --batch-size 4 --train-iters 100 --save-name cpm_bee_finetune --max-length 2048 --save results/ --lr 0.0001 --inspect-iters 100 --warmup-iters 1 --eval-interval 1000 --early-stop-patience 5 --lr-decay-style noam --weight-decay 0.01 --clip-grad 1.0 --loss-scale 32768 --start-step 0 --load path/pytorch_model_10b.bin

====================== Initialization ====================== rank : 0 local_rank : 0 world_size : 4 local_size : 4 master : star-SYS-420GP-TNR:37257 device : 0 cpus : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1 3, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 2 4, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 3 5, 36, 37]

juexingyezuile avatar Jun 07 '23 02:06 juexingyezuile

你们怎么都那么有钱

xiaoguaishoubaobao avatar Jun 10 '23 02:06 xiaoguaishoubaobao

同问:单机多卡加载模型时卡住

Universe-Sun avatar Jun 15 '23 08:06 Universe-Sun

同问:单机多卡加载模型时在bmt.init_distributed处卡住

wyl7 avatar Jul 20 '23 04:07 wyl7