SFT training getting nan loss when using PP=4, TP=4 and model params > 7b
Describe the bug I am trying to finetune the OPT-13b model and getting nan loss at step=4 with following configuration. PP=4 TP=4 MBS=4 Batch size=128
Running model less than or equal to 7b parameters with above configuration works fine.
What could be the issue and How can I fix it?
A clear and concise description of what the bug is.
Steps/Code to reproduce bug Following llama SFT training guide. https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)] Docker nemo:24.03.01
- Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here. Example: GPU model
Seeming this repo is not very active, My case is Llama 3 8b Instruct I'm just change the learning rate to at small as possible (e.g 1e-8) and set the global batch size to 256 in-order to help the training progress stable. Hope this will help.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.