NanoCode012 comments

Results 163 comments of


                                            NanoCode012

How to use deepspeed zero3 with cpu offload?

Are you sure it's due to cpu offload? Can you try use the regular ds3_bf16

[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.

Could you see if there's some earlier errors?

[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.

Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md

[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.

Could you give that doc a try either way?

[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.

Hello @Ki6an @yuleiqin , have you tried the linked nccl doc if it solved it?

Secret in webhook

I know this is 3 years too late, but I would like to add this for future readers. I implemented this feature in my fork but did not create a...

[Bug] Training on 13B causes loss to be 0, while 7B works fine

@ElleLeonne , thank you for answering. I also see your loss being 0. Isn't that incorrect? I don't think it should go that low right? I attached a sample training...

[Bug] Training on 13B causes loss to be 0, while 7B works fine

Hello @ElleLeonne , thanks for the reply. > when switching to a new dataset I noticed this issue originally with a custom dataset but was also able to reproduce it...

[Bug] Training on 13B causes loss to be 0, while 7B works fine

> Yes, the original cleaned version worked fine. After fixing the problem, Loss appears to stay steady for a single epoch. @ElleLeonne , may I clarify which model size you...

[Bug] Training on 13B causes loss to be 0, while 7B works fine

> 7bn works with the cleaned alpaca dataset, and Another dataset of mine that uses a similar, yet not identical, format, with different key names. > […](#) @ElleLeonne , have...