NanoCode012
NanoCode012
Are you sure it's due to cpu offload? Can you try use the regular ds3_bf16
Could you see if there's some earlier errors?
Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md
Could you give that doc a try either way?
Hello @Ki6an @yuleiqin , have you tried the linked nccl doc if it solved it?
I know this is 3 years too late, but I would like to add this for future readers. I implemented this feature in my fork but did not create a...
@ElleLeonne , thank you for answering. I also see your loss being 0. Isn't that incorrect? I don't think it should go that low right? I attached a sample training...
Hello @ElleLeonne , thanks for the reply. > when switching to a new dataset I noticed this issue originally with a custom dataset but was also able to reproduce it...
> Yes, the original cleaned version worked fine. After fixing the problem, Loss appears to stay steady for a single epoch. @ElleLeonne , may I clarify which model size you...
> 7bn works with the cleaned alpaca dataset, and Another dataset of mine that uses a similar, yet not identical, format, with different key names. > […](#) @ElleLeonne , have...