minimal-llama
minimal-llama copied to clipboard
Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step
Your model does not seem to be able to calculate the gradients of the layers correctly. When I run finetune_pp.py and print the loss during training, after the first optimizer step, the loss becomes the following:
tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=<NllLossBackward0>)
Can you reproduce this on your machine? Otherwise, would you be willing to share your pip freeze, so that I can try out, if there is a package mismatch?