torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

adding model builders for code-llama2 7b, 13b, and 70b

Open SalmanMohammadi opened this issue 10 months ago • 9 comments

Context

What is the purpose of this PR? Is it to

  • [x] add a new feature
  • [ ] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

See https://github.com/pytorch/torchtune/issues/826

Changelog

Added model builders for code-llama2 7b, 13b, and 70b based on base llama2 params and extended vocab size and sequence length of code-llama2. Capitalised all instances of 'b' as in '7b' in torchtune/models/llama2/_model_builders.py.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

  • [x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • [ ] add unit tests for any new functionality
  • [x] update docstrings for any new or updated methods or classes
  • [x] run unit tests via pytest tests
  • [ ] run recipe tests via pytest tests -m integration_test
  • [x] manually run any new or modified recipes with sufficient proof of correctness
  • [x] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

SalmanMohammadi avatar Apr 23 '24 20:04 SalmanMohammadi

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/847

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 98ef8b8d734bca8512b66c3526675f8d2ea6ccea with merge base bec7babec9c924a0ee7ad27e3f6582bc5bd1fef5 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Apr 23 '24 20:04 pytorch-bot[bot]

Full fine tuning using the low memory config runs fine in colab. See the wandb run here. I'll let it run for ~30 minutes for now, unless you need information from later in training.

I can't test the lora fine tuning since it hasn't been implemented yet. I can try give full fine-tuning (without low_memory) a go if I can fit it on the GPU there.

One thing that springs to mind for testing @rohan-varma could be ensuring specific reference weights load correctly for models like these (e.g. codellama/CodeLlama-7b-Instruct-hf). In my example, the weights were in a slightly unexpected format for tune run and I manually specified the checkpoints.

A quick note: bitsandbytes wasn't installed by default for running with the low_memory config. I saw in a previous issue you were debating including it in requirements.

SalmanMohammadi avatar Apr 23 '24 22:04 SalmanMohammadi

@SalmanMohammadi I can't open the colab notebook you've shared (error includes Ask the notebook's author to reshare the notebook with download permissions enabled and try loading it again), do you mind checking that on your end? Other than that, great to see that we're able to run on colab!

rohan-varma avatar Apr 23 '24 22:04 rohan-varma

Try now? The wanbd link should work too. It was pretty straightforward! Unfortunately, none of the models can fit in the free GPU since bf16 isn't supported on the free GPU - but otherwise super neat.

SalmanMohammadi avatar Apr 23 '24 22:04 SalmanMohammadi

Lots of things to reply to here 😃

what's our overall approach for continuing to guarantee correctness here?

@rohan-varma this is a good question. In this case my mental model is that this is just a variant of an existing model (Llama2) and so the individual components should already be well-tested. Some E2E test is still helpful to verify that (a) checkpoints load correctly and (b) no regressions due to other changes (e.g. I think the tokenizer has a slightly different vocab size). Admittedly this is subjective and depends on the level of granularity we define models at though (e.g. most of our models are instances of TransformerDecoder but I don't think it's sufficient to just have one test for that class and claim all new models we add are covered).

I can't test the lora fine tuning since it hasn't been implemented yet.

@SalmanMohammadi what does this mean? Can't you just plug the code-llama2 models into our existing LoRA recipes? (Apologies if I'm missing something obvious here though)

In my example, the weights were in a slightly unexpected format for tune run and I manually specified the checkpoints.

Can you elaborate on this? Did you need to make any changes to the checkpoints themselves or just the file paths? (Feel free to just paste your CLI command or config file if that's easiest)

Unfortunately, none of the models can fit in the free GPU since bf16 isn't supported on the free GPU - but otherwise super neat.

Actually we can do QLoRA for Llama2-7B now in the free tier (though we do still OOM on checkpoint save). We also have smaller Gemma models, I haven't tested them myself but I think these should be OK in fp32 on the free tier too.

ebsmothers avatar Apr 23 '24 22:04 ebsmothers

Can't you just plug the code-llama2 models into our existing LoRA recipes? (Apologies if I'm missing something obvious here though)

Sorry, by "not implemented" I just mean that the QLoRA and LoRA recipes use the llama2.lora_llama2_ and llama2.qlora_lamma2_ model builders, and I need to create llama2.lora_code_llama2_ etc. I'll add that now and run it on colab.

Can you elaborate on this? Did you need to make any changes to the checkpoints themselves or just the file paths? (Feel free to just paste your CLI command or config file if that's easiest)

Just the files! My CLI command was:

!tune run full_finetune_single_device \
--config llama2/7B_full_low_memory \
checkpointer.checkpoint_dir=/tmp/CodeLlama-7b-Instruct-hf \
checkpointer.checkpoint_files=['pytorch_model-00001-of-00003.bin','pytorch_model-00002-of-00003.bin','pytorch_model-00003-of-00003.bin'] \
tokenizer.path=/tmp/CodeLlama-7b-Instruct-hf/tokenizer.model \
metric_logger=torchtune.utils.metric_logging.WandBLogger \
metric_logger.project=torchtune_codellama_testing \
model=torchtune.models.llama2.code_llama2_7b

This is because the model checkpoints I'd grabbed were in this format:

!ls /tmp/CodeLlama-7b-Instruct-hf
...
pytorch_model-00001-of-00003.bin
pytorch_model-00002-of-00003.bin
pytorch_model-00003-of-00003.bin

But tune was expecting pytorch_model-00001-of-00002.bin for the first.

SalmanMohammadi avatar Apr 24 '24 09:04 SalmanMohammadi

I've added lora_ and qlora_ code_llama_{}b models, and also added a qlora_llama2_70b while I was at it. torchtune/models/llama2/_model_builders.py is getting pretty chunky. Do you guys care about this/would you want to throw the lora model builders or code_llama_ builders in a separate file?

I've completed the following training tests using my colab above, and added some memory usage info for reference. You can see all the runs here:

  • code_llama2_7b with full_finetune_single_device with _low_memory - wandb run. Peak memory usage 14.5GB
  • lora_code_llama2_7b with lora_finetune_single_device - wandb run. Peak memory usage 14.1GB.
  • qlora_code_llama2_13b with lora_finetune_single_device - wandb run. Peak memory usage 9.4GB.

Let me know if there's anything else I can do :)

SalmanMohammadi avatar Apr 24 '24 13:04 SalmanMohammadi

Thanks so much for the kind feedback @kartikayk :) I've always wanted to contribute to the pytorch ecosystem - it's really nice to get the opportunity to work with such a welcoming open-source community.

Sorry for so many commits. Lots of different components, lots of docs, and I couldn't test locally.

I've updated with the refactor. I also added recipe configs, and added the additional recipe configs to torchtune/_recipe_registry.py for ease of discovery. Confirmed tests/test_import_recipes.py runs OK. I took the liberty to update the README.md model support table, I hope you don't mind : ) Hopefully this helps people to get started quickly fine-tuning code-llama-2 models. tune ls now outputs:

RECIPE                                   CONFIG                                  
full_finetune_single_device              ...               
                                         code_llama2/7B_full_low_memory          
                                         ...                                     
lora_finetune_single_device              ...           
                                         code_llama2/7B_lora_single_device       
                                         code_llama2/7B_qlora_single_device      
                                         ...          

I've confirmed all the recipes I added work nicely on my colab without any additional config specifications:

tune download codellama/CodeLlama-7b-hf --output-dir /tmp/CodeLlama-7b-hf
tune run full_finetune_single_device --config code_llama2/7B_full_low_memory 
tune run lora_finetune_single_device --config code_llama2/7B_lora_single_device
tune run lora_finetune_single_device --config code_llama2/7B_qlora_single_device

Note: My initial colab tests in https://github.com/pytorch/torchtune/pull/847#issuecomment-2074894947 were for Code-Llama2-7b-Instruct, but I've generalized the recipes to just Code-Llama2-7b. It's hopefully trivial for users to use the instruct models instead.

SalmanMohammadi avatar Apr 25 '24 12:04 SalmanMohammadi

Sorry for so many commits. Lots of different components, lots of docs, and I couldn't test locally.

@SalmanMohammadi I'm curious about this comment. Is this just due to particulars of your dev setup? Mainly I am wondering if there's anything we can be doing on our end to make contribution smoother (whether it be ease of testing, clearer documentation, anything like that). If you have any feedback on this front do let me know!

ebsmothers avatar Apr 25 '24 17:04 ebsmothers

I updated the docs, and I've just taken out the QLoRA 70B models if it's out of scope out ATM, particularly for this PR.

Is this just due to particulars of your dev setup?

I think it's somewhat my current dev setup being unable to test full-scale model trainings right now, and a bit of learning the codebase so iterating on "trying something out on colab to see what it breaks", and also I could've been a bit more careful updating each of the docs and recipes. So mostly on my side!

One thing I was thinking of was adding a rough workflow for common contributions for which there's a high standard for. In the example of https://github.com/pytorch/torchtune/pull/840, I've been thinking of updating tests/torchtune/models/llama2/scripts/README.md to add some of your comments from https://github.com/pytorch/torchtune/pull/840#discussion_r1579801172 (or other useful insights you provide when we start writing mistral tests).

*commit message should read *"removing qlora 70b code_llama2 and llama2 models". Some weird formatting issue.

SalmanMohammadi avatar Apr 26 '24 14:04 SalmanMohammadi

OK just a few more small comments. Home stretch here! After those are addressed I think this is good to merge

ebsmothers avatar Apr 26 '24 17:04 ebsmothers

OK just a few more small comments. Home stretch here! After those are addressed I think this is good to merge

Hopefully all done! Thanks for your patience :)

SalmanMohammadi avatar Apr 26 '24 18:04 SalmanMohammadi

OK just a few more small comments. Home stretch here! After those are addressed I think this is good to merge

Hopefully all done! Thanks for your patience :)

Great! Just kicked off one more CI run now, once that is green I think this is good to merge.

ebsmothers avatar Apr 26 '24 19:04 ebsmothers