torchtune How I can find all the checkpoints and merge it manually? (Lora)

Great job guys for this awesome tool. I have just started using this and loving it already. I have one question, I am fine-tuning for 6 epochs and want to store each checkpoint separately. Later I would like to evaluate each checkpoint, how can I do that?

May 02 '24 18:05 monk1337

Hi @monk1337 thanks for the issue! Glad to hear you're finding the library useful. To clarify, are you interested in storing just the LoRA weights from the end of each epoch so that you can compare evaluations across different epochs?

To give a bit more info.. we will output two checkpoints at the end of each epoch to your output directory: for epoch i these would be adapter_i.pt and {prefix}_model_i.pt (the value of prefix depends on which checkpoint format you're loading in). adapter_i.pt is a smaller checkpoint containing only the LoRA weights, while {prefix}_model_i.pt will contain the LoRA weights merged back into the original model. So if you want to evaluate how your fine-tuned checkpoints are doing after each epoch you can use the latter, as it will contain updated versions of the params of the original model based on your fine-tune.

For evaluation, we also have an integration with EleutherAI's eval harness so you are welcome to use that if you like. If you want more details on how to do this you can check out this section of our end-to-end tutorial.

Let me know if this makes sense or if there's something else you're looking for here, happy to address any follow-ups you may have.

May 02 '24 19:05 ebsmothers

@ebsmothers Thank you for your detailed reply. I have one follow-up question, How can I convert this merge model folder which contains multiple {prefix}_model_i.pt to huggingface format and upload it? I am trying to use the native lm harness but it's giving errors due to the pt format

how to convert the .pt format of torch tune to hf so I can use other tools easily
How to use distributed GPU during inference time of lm harness using torchtune?

May 03 '24 02:05 monk1337

Hi @monk1337 this is a good question. The format we output should generally adhere to the same format as the inputs (i.e. the logic for distributing weights across files should line up exactly). So in this case the format should still match HF.

The main difference would be usage of safetensors (as you pointed out, we do write out to .pt format). Can you share a stack trace on (1) so I can see where it's coming from? I'll need to figure out if the issue is that we aren't writing out to safetensors format, or if there's something else happening here.

May 03 '24 15:05 ebsmothers

@monk1337 , I am closing it since it was 2 months ago. Please feel free to reopen it if you still have questions! thanks :)

Jun 28 '24 15:06 felipemello1

torchtune torchtune copied to clipboard

How I can find all the checkpoints and merge it manually? (Lora)

torchtune
torchtune copied to clipboard