pyt
pyt
But I hit the same problem when train legacy model
@lmcafee-nvidia During conversion.
@lmcafee-nvidia Megatron conversion works. But during training, we hit the exact error as this post. So we change the conversion type to --saver mcore. But the conversion couldn't finish. We...
@lmcafee-nvidia Just another update, We also tried this two flag. None of the solution you provide works for us --use-legacy-models --ckpt-format torch It still hit the state_dict error: ``` [rank4]:...
> [@TeddLi](https://github.com/TeddLi) You should use spawn as the start method for torch multiprocessing, otherwise CUDA context cannot be properly set up. A simple way to fix it is just add...
> Let's keep the discussion of github for now. Did you consider making a reproducible example? If you setup a script based on a public checkpoint, I can try to...