cryodrgn Stop training on nan

cryodrgn train_vae should break if nan loss is hit instead of silently continuing for the specified number of epochs

Sep 07 '22 02:09 zhonge

@vineetbansal could you add this feature? Thanks!

Sep 20 '22 18:09 zhonge

The --do-pose-sgd option can frequently cause pose parameters to be nan. Because pose_optimizer.step() not scaled in automatic mixed precision training. I fixed my code as below:

            if do_pose_sgd and epoch >= args.pretrain:
                if args.amp:
                    optim.zero_grad()
                    scaler.step(pose_optimizer)
                    scaler.update()
                else:
                    pose_optimizer.step()
                if torch.any(torch.isnan(posetracker.rots_emb.weight[ind])) or torch.any(torch.isnan(posetracker.trans_emb.weight[ind])):
                    raise RuntimeError("NaN Found in Pose.")

The scaler.step can detect nan and skipped to avoid corrupting the params. It should be noted that this style is incompatible with apex.amp.

Sep 30 '22 07:09 lipan6461188