lmcafee-nvidia comments

Results 8 comments of


                                            lmcafee-nvidia

[ENHANCEMENT] support zero 2 distributed optimize

We do not currently support Zero 2/3. But it is possible that we will support this in the future.

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

@mxjmtxrm , our instructions could be clearer in these docs regarding the compatibility between the converter's `--saver` arg and the training model format. There are two model formats, `legacy` (a.k.a.,...

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

This error occurred because the default value for `--ckpt-format` is `torch_dist`, which means that the distributed checkpoint format is used, which is incompatible with the legacy model format. Again, there...

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

Just to be clear, did your error above happen during conversion or during training? The extra lines at the bottom showing `sending transformer layer ...` indicate that this is a...

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

@zshCuanNi , thanks for your suggestion regarding `spawn`, though in internal testing we have not encountered any issues related to this. Perhaps something is different in our environment setups, but...

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

Let's keep the discussion of github for now. Did you consider making a reproducible example? If you setup a script based on a public checkpoint, I can try to debug...

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

thanks @zshCuanNi for your script. I'll run it and get back to you

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

@zshCuanNi , I didn't see any errors when I ran your conversion command above on the Llama 3 8B model. I tested your conversion script for a few different NGC...