NeMo SFT training getting nan loss when using PP=4, TP=4 and model params

Describe the bug I am trying to finetune the OPT-13b model and getting nan loss at step=4 with following configuration. PP=4 TP=4 MBS=4 Batch size=128

Running model less than or equal to 7b parameters with above configuration works fine.

What could be the issue and How can I fix it?

A clear and concise description of what the bug is.

Steps/Code to reproduce bug Following llama SFT training guide. https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)] Docker nemo:24.03.01
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here. Example: GPU model

Aug 29 '24 22:08 PurvangL

Seeming this repo is not very active, My case is Llama 3 8b Instruct I'm just change the learning rate to at small as possible (e.g 1e-8) and set the global batch size to 256 in-order to help the training progress stable. Hope this will help.

Sep 17 '24 08:09 BaoLocPham

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Oct 18 '24 01:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Oct 25 '24 01:10 github-actions[bot]

SFT training getting nan loss when using PP=4, TP=4 and model params > 7b