lmcafee-nvidia
lmcafee-nvidia
We do not currently support Zero 2/3. But it is possible that we will support this in the future.
@mxjmtxrm , our instructions could be clearer in these docs regarding the compatibility between the converter's `--saver` arg and the training model format. There are two model formats, `legacy` (a.k.a.,...
This error occurred because the default value for `--ckpt-format` is `torch_dist`, which means that the distributed checkpoint format is used, which is incompatible with the legacy model format. Again, there...
Just to be clear, did your error above happen during conversion or during training? The extra lines at the bottom showing `sending transformer layer ...` indicate that this is a...
@zshCuanNi , thanks for your suggestion regarding `spawn`, though in internal testing we have not encountered any issues related to this. Perhaps something is different in our environment setups, but...
Let's keep the discussion of github for now. Did you consider making a reproducible example? If you setup a script based on a public checkpoint, I can try to debug...
thanks @zshCuanNi for your script. I'll run it and get back to you
@zshCuanNi , I didn't see any errors when I ran your conversion command above on the Llama 3 8B model. I tested your conversion script for a few different NGC...