torchtune Llama-3 Inference and Uploading to Huggingface

I'm trying to fine-tune Llama-3-8B and 70B with LoRA on a custom drug detection dataset and upload it to Huggingface so that it fits nicely into an existing evaluation zero-shot evaluation pipeline. My current challenge lies in converting the different checkpoints - 8B uses FullModelMetaCheckpointer that outputs a meta_model_{i}.pt, whereas 70B uses FullModelHFCheckpointer and outputs hf_model_{idx}_{i}.pt where i is for each epoch (over 5 epochs).

Question 1: Is it safe to assume that we only need the checkpoint with the highest i index and can delete the intermediate ones? If so, it would be handy to have a config option to conserve diskspace by keeping the most recent checkpoint only.

As discussed in #832, we need to convert the 8B model in llama format to hf format. To do this, I've had to move a lot of the contents from the original folder into a the output_dir (e.g. tokenzier.model, etc) and then run a script from transformers (here).

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir <checkpoint_dir> \
--llama_version 3 \
--model_size 8B \
--output_dir <hf_staging_dir>

This script is specifically looking for model weights from a file called consolidated.00.pth (L161), which is the original, untrained 8B Llama-3. It's not clear to me how to have it use the lora merged meta_model_5.pt. When I tried to follow the instructions from the e2e example and just upload <checkpoint_dir> directly, HF errors out saying it can't find a file with a suitable format (e.g. pytorch_model.bin, etc).

Question 2: How can we convert both the 8B and 70B versions of the lora fine-tuned LLama-3s so that they are suitable for inference via HF?

May 03 '24 19:05 fabriceyhc

@fabriceyhc You have to change the file name in the script to

        loaded = [
            torch.load(os.path.join(input_base_path, f"hf_model_{i:04d}_0.pt"), map_location="cpu")
            for i in range(1, num_shards + 1)  # Start from 1 and go up to num_shards
           ]

+1 because the naming convention start with 0001 for 70B

Hi @joecummings , could you please take a look at the script that I've been working on? I made some changes to the names, but now I'm encountering new errors when I try to convert it into hf weights for 70B. However, the conversion seems to work fine for 8B. At the moment, the issue appears to be specific to torchtune. Once we fine-tune the model, it would be great if there's a command line option available to convert the pt weights to hf and upload them to Hugging Face. From there, we can proceed with other tasks as the entire ecosystem is based around Hugging Face.

https://gist.github.com/monk1337/925a5a44c431ed1f1d3927141f31b6d2

May 06 '24 07:05 monk1337

Once we fine-tune the model, it would be great if there's a command line option available to convert the pt weights to hf and upload them to Hugging Face. From there, we can proceed with other tasks as the entire ecosystem is based around Hugging Face.

totally agree that we need this!

May 10 '24 15:05 optimass

https://gist.github.com/monk1337/925a5a44c431ed1f1d3927141f31b6d2 I tried this w/ for LLAMA-3-8b and got the following error:

File "/home/toolkit/ui-copilot/finetuning/utils/convert_llama_weights_to_hf2.py", line 447, in main
    write_model(
  File "/home/toolkit/ui-copilot/finetuning/utils/convert_llama_weights_to_hf2.py", line 195, in write_model
    f"model.layers.{layer_i}.input_layernorm.weight": loaded[0][
    KeyError: 'layers.0.attention_norm.weight'

May 10 '24 16:05 optimass

Hey! Sorry you're running into issues here. I didn't realize there are differences in how the 8B and 70B models are converted. Let me take a look into this in a bit.

May 10 '24 16:05 kartikayk

We have this function in the checkpointer, but seems like this isn't getting the job done. So need to figure out why that is.

May 10 '24 16:05 kartikayk

@fabriceyhc Some thoughts on the questions you asked above:

My current challenge lies in converting the different checkpoints - 8B uses FullModelMetaCheckpointer that outputs a meta_model_{i}.pt, whereas 70B uses FullModelHFCheckpointer and outputs hf_model_{idx}_{i}.pt where i is for each epoch (over 5 epochs)

The checkpointer used depends on the input checkpoint format. The 8B model makes uses of the consolidated.00.pth file which is in Meta format. But you can update this to use the safetensors checkpoints and use the HFCheckpointer instead. This should address the discrepancy in configs between 8B and 70B.

Is it safe to assume that we only need the checkpoint with the highest i index and can delete the intermediate ones? If so, it would be handy to have a config option to conserve diskspace by keeping the most recent checkpoint only.

yes this is the right understanding. Adding this flag has been on our TODO list for a while now. If you'd be open to contributing this as a PR, I'd be happy to collaborate with you on the review.

How can we convert both the 8B and 70B versions of the lora fine-tuned LLama-3s so that they are suitable for inference via HF?

As I commented above, the FullModelCheckpointer does this conversion for you. But seems like you're still running into issues?

May 10 '24 16:05 kartikayk

As I commented above, the FullModelCheckpointer does this conversion for you. But seems like you're still running into issues?

Yes, it's still unclear to me how to use the FullModelHFCheckpointer's outputs to HF's APIs, in particular Text-Generation Inference (TGI).

May 10 '24 17:05 optimass

@kartikayk I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here?

I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'.

anyway, I need your help here. thanks

Jun 10 '24 16:06 Respaired

Thanks @SoshyHayami for providing additional details here, @joecummings is gonna take a look at this

Jun 10 '24 16:06 ebsmothers

Hey @SoshyHayami, I figured out a workaround based on #878. Inside your training folder, you should have a file called model.safetensors.index.json. All you need to do is edit the weight_map values to point to your .pt files. Huggingface will still be able to read them. I've confirmed upload to hf and can download the model onto other servers just like any other without issues.

Here's what my file looks like after manually editing it: model.safetensors.index.json

Since I fine-tuned my model for 16 epochs, my idx is 15 (e.g. hf_model_0030_15.pt). You'll just have to change yours to whatever the index on your end is.

Jun 10 '24 17:06 fabriceyhc

@fabriceyhc Thanks for the tip. the torchtune uses the original llama checkpoint to initiate a training session and I didn't use HF checkpointer to save the model. so I'm not sure if it'll work.

Jun 10 '24 18:06 Respaired

@kartikayk I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here?

I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'.

anyway, I need your help here. thanks

Can you confirm that you have a single output file for your finetuned model named something like meta_model_X.pt

Also, what convert script did you use exactly to convert to safetensors? I see this one but it would need to be modified to use a local file instead of pulling from the HF Hub.

Lastly, what do you mean by performance seems to be a lot worse? What are you using to evaluate performance in this case?

Jun 10 '24 18:06 joecummings

@kartikayk I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here? I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'. anyway, I need your help here. thanks

Can you confirm that you have a single output file for your finetuned model named something like meta_model_X.pt

Also, what convert script did you use exactly to convert to safetensors? I see this one but it would need to be modified to use a local file instead of pulling from the HF Hub.

Lastly, what do you mean by performance seems to be a lot worse? What are you using to evaluate performance in this case?

1- yes. meta_model_1.pt

2- No, I used the one from the transformers repo. the one the OP mentioned using in his first post. they support Llama 3 conversion but with the .pth format. I just hard-coded that line of code where you load the model with my own checkpoint's absolute path.

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

Jun 10 '24 19:06 Respaired

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

Jun 10 '24 21:06 JonasQN

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

it's either the conversion or there's something wrong with the training process. (not necessarily with torch tune, but rather data or template, using instruct model instead of the base model, etc.) I no longer have time or the compute to test it, but if you can, try axolotl to see if that gives you better results, if you did I appreciate it if you let me know.

Jun 10 '24 22:06 Respaired

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

Can you expand on "really off"? Also while I'm investigating this issue, I'm using the following code snippet to check the converted safetensor files on the HF from_pretrained side:

from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from torchtune.utils import set_seed
import torch

set_seed(1234)

with torch.device("cuda"):
    model = LlamaForCausalLM.from_pretrained("/path/to/converted/model")
    tokenizer = PreTrainedTokenizerFast.from_pretrained("/path/to/converted/model")

    tokens = tokenizer("Tell me a joke", return_tensors="pt")
    outputs = model.generate(**tokens, top_k=300, max_new_tokens=300, do_sample=True, temperature=0.6)

    tokenizer.batch_decode(outputs, skip_special_tokens=True)

If this does not resemble your workflow, please let me know.

Jun 11 '24 14:06 joecummings

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

Can you expand on "really off"? Also while I'm investigating this issue, I'm using the following code snippet to check the converted safetensor files on the HF from_pretrained side:
from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from torchtune.utils import set_seed
import torch

set_seed(1234)

with torch.device("cuda"):
    model = LlamaForCausalLM.from_pretrained("/path/to/converted/model")
    tokenizer = PreTrainedTokenizerFast.from_pretrained("/path/to/converted/model")

    tokens = tokenizer("Tell me a joke", return_tensors="pt")
    outputs = model.generate(**tokens, top_k=300, max_new_tokens=300, do_sample=True, temperature=0.6)

    tokenizer.batch_decode(outputs, skip_special_tokens=True)
If this does not resemble your workflow, please let me know.

I'm using the similar code, the result I got for my prompt was just repeating my prompt and starting to generate these (I don't have any Russian data in my finetuning dataset): прикладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладпр... Whenever I'm using the torchtune's generation script it works completely fine

Jun 11 '24 19:06 JonasQN

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

it's either the conversion or there's something wrong with the training process. (not necessarily with torch tune, but rather data or template, using instruct model instead of the base model, etc.) I no longer have time or the compute to test it, but if you can, try axolotl to see if that gives you better results, if you did I appreciate it if you let me know.

I was able to generate for my test data by modifying the generate.py from the recipe folder, maybe you could try to do that too.

Jun 11 '24 20:06 JonasQN

@joecummings @ebsmothers In my PR, I added the option to save the model checkpoints as safetensors format directly. This will allow models finetuned using torchtune to be used directly with various HuggingFace libraries like Text-Generation Inference (TGI), etc. and also improve overall compatibility with HuggingFace, removing some of the problems mentioned above. Hope it helps! Let me know if you have any considerations regarding this. Thank you!

Jun 20 '24 00:06 jeffrey-fong

torchtune torchtune copied to clipboard

Llama-3 Inference and Uploading to Huggingface

torchtune
torchtune copied to clipboard