Lan_se&shanzhu
Results
2
comments of
Lan_se&shanzhu
> 使用二卡训练正常,大于2的都会只加载2次模型,然后卡住,不进行训练 你好,我想请问下,使用2卡训练报错是怎么回事,我的指令式是这个NPROC_PER_NODE=2 xtuner train test_myllama_train.py --deepspeed deepspeed_zero2;显示torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
''' # whether to resume training from the loaded checkpoint resume = False ''' change 'False' into 'True'