[Model] LoRA with lm_head fully trained
FIX #4186 #2816
Support lm_head and embed_tokens fully trained in LoRA.
We found that quality of our adapters significantly drops without fully-trained lm_head or lm_head trained in LoRA style. This is functionality of peft modules_to_save=[lm_head, mebed_tokens] https://huggingface.co/docs/peft/v0.12.0/en/package_reference/#peft.LoraConfig.modules_to_save
The idea is to replace base_model VocabParallelEmbedding and ParallelLMHead by layers loaded from modules_to_save at inferencing LoRA
- [x] dirty implementation
- [x] tests for new functionality
- [ ] checking old functionality is working
- [x] inference with fully trained lm_head performance measurement
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
/ready
should it unmarked as Draft ?
This pull request has merge conflicts that must be resolved before it can be merged. @sergeykochetkov please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
/ready
should it unmarked as Draft ?
yes, i am waiting for review
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sergeykochetkov.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Great work! Some CIs show namespace inconsistency for new added symbols. I think it is time to fix and merge after such CIs passed. @youkaichao
Just wanted to throw out that this is something I am looking forward to.
I am attempting to use Qwen/Qwen2.5-14B as a base model, and load up two LoRA's with the OpenAI API. One of the LoRA's is just the Instruct model extracted as a LoRA from the base. The other model is a fine tune that I did off of the base, and used MergeKit to do a TIES merge with the base and instruct model, and then extracted an adapter from that merge.
Works great when I was testing with HF transformers, but was surprised when I was getting errors trying to use these adapters with vLLM.
PR is recreated here
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!