add megatron_mcore saver/loader
add saver/loader for mcore checkpoint convert
Hi, I think your conversation code has some potential bugs:
https://github.com/NVIDIA/Megatron-LM/issues/703
Hi, I think your conversation code has some potential bugs:
#703
I try it with vicuna 7b model, the inference is ok
you can add some conversation template for inference
I changed the model from llama2-7b to llama2-7b-chat. Now it works! Thank you!
However, when I try llama-2-70b-chat, it doesn't work anymore... (I run with tensor parallel = 2 and pipeline parallel = 2)
Is there any problem with transformation or Megatron-LM?
The problem is that the numbers are never shown in terminal:
Enter prompt: What is the answer of 1+1=?
Enter number of tokens to generate: 100
Megatron Response:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
What is the answer of 1+1=? [/INST] The answer to the equation is "The answer to the equation is two (or "the result is two").
The calculation works like this because when you add one plus one (or +), the result is two (or =).
Therefore the answer is two."
Is there anything else I can help you with ?
I find that there is something wrong with tensor parallel for transformation. Here is the experiments:
- TP=4, PP=1 ❌
- TP=2, PP=2 ❌
- TP=1, PP=4 ✅
So... how to fix it?
Finally, I figured it out that I have to set parallel_output=False to make inference work with tensor parallel.
Finally, I figured it out that I have to set
parallel_output=Falseto make inference work with tensor parallel.
Thanks for your tests, text_generation_server on main branch does not support mcore inference, maybe we can fix it
https://github.com/NVIDIA/Megatron-LM/blob/ad53b1e38689a0ceed75ade7821f4e6c7554abb4/tools/run_text_generation_server.py#L22
I just use this model provider by setting parallel_output=False:
https://github.com/NVIDIA/Megatron-LM/blob/ad53b1e38689a0ceed75ade7821f4e6c7554abb4/pretrain_gpt.py#L54-L66
It works fine in text_generation_server.
Moreover, I find that padding vocab is necessary to finetune llama-70b with tensor parallel >= 4.
@TING2938 Is there a tool available for converting checkpoints that include optimizer states while changing the tensor parallelism (TP) and pipeline parallelism (PP) configurations? Thank you.
Marking as stale. No activity in 60 days.
@TING2938 Is there a tool available for converting checkpoints that include optimizer states while changing the tensor parallelism (TP) and pipeline parallelism (PP) configurations? Thank you.
The repo already include tools to produces weights in partition (tp/pp mode), you can include optimization states when doing sft/continue train jobs.
Marking as stale. No activity in 60 days.
@TING2938 Thanks for the contributions! Since this PR was opened, checkpoint conversion tools have evolved significantly. Megatron-LM has the functionality you proposed around HF Llama conversion.
In addition, Megatron-Bridge is the primary location for this functionality and provides comprehensive bidirectional Hugging Face ↔ Megatron conversion.
If there's any functionality not covered by existing checkpoint conversion tools, we'd welcome contributions to Megatron Bridge going forward.