pyt
pyt
@thegregyang , I trained a model with Mup, Just wondering how could I convert my Mup model weight to SP so that I could load with huggingface?
Nope, I didn't find out why. If I set the step to 20000, then it works. But If I set it a bit longer, e.g, 200000 step. Then It will...
@awaelchli Hey, Thanks for taking a close look at that. Honestly, I just switched a machine..... The new server provider used docker. I doubt its a hardware issue
Also I do try to extend the timeout from 30 min to 8 hours. But still, no luck to make it run properly. I am not sure if the extend...
The issue is gone. For the machine that has that issue, I am still hold it though. If I just train from start, it won't hit any issue.
@awaelchli If you want to look into that, I can provide the info you need. Just close the ticket for now
> thon failed to use multi Just wondering what is the final results with Srun and without Srun. Does Srun give worse results?
Have you ever tried load HF model and continue train it?
@lmcafee-nvidia I try to use --saver mcore , but it will hit another error. `Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108,...
I did used --saver megatron. And the conversion could be done without problem