minimal-llama Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step

Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step

Open fs4r opened this issue 1 year ago • 2 comments

Your model does not seem to be able to calculate the gradients of the layers correctly. When I run finetune_pp.py and print the loss during training, after the first optimizer step, the loss becomes the following:

tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=<NllLossBackward0>)

Can you reproduce this on your machine? Otherwise, would you be willing to share your pip freeze, so that I can try out, if there is a package mismatch?

Mar 12 '23 15:03 fs4r

minimal-llama minimal-llama copied to clipboard

Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step

minimal-llama
minimal-llama copied to clipboard