fastNLP icon indicating copy to clipboard operation
fastNLP copied to clipboard

多卡分布式训练

Open houdawang opened this issue 1 year ago • 0 comments

你好。我在trainer中设置了如下参数( trainer = Trainer( driver="torch", train_dataloader=dl["train"], evaluate_dataloaders=dl["dev"], device=[4,7], callbacks=callback, optimizers=optimizer, n_epochs=args.epoch, accumulation_steps=args.accumulation_steps, torch_kwargs = {'ddp_kwargs':{'find_unused_parameters':True}} ) trainer.run())确实是在两张卡上运行了起来 但是训练过程打印的loss:NAN,并且每个epoch打印的每个指标都是一个相同的值,请问问题出在哪里

houdawang avatar Apr 28 '24 08:04 houdawang