pyt
pyt
@thegregyang , I trained a model with Mup, Just wondering how could I convert my Mup model weight to SP so that I could load with huggingface?
Nope, I didn't find out why. If I set the step to 20000, then it works. But If I set it a bit longer, e.g, 200000 step. Then It will...
@awaelchli Hey, Thanks for taking a close look at that. Honestly, I just switched a machine..... The new server provider used docker. I doubt its a hardware issue
Also I do try to extend the timeout from 30 min to 8 hours. But still, no luck to make it run properly. I am not sure if the extend...
The issue is gone. For the machine that has that issue, I am still hold it though. If I just train from start, it won't hit any issue.
@awaelchli If you want to look into that, I can provide the info you need. Just close the ticket for now
> thon failed to use multi Just wondering what is the final results with Srun and without Srun. Does Srun give worse results?