vllm [Model] LoRA with lm_head fully trained

FIX #4186 #2816

Support lm_head and embed_tokens fully trained in LoRA.

We found that quality of our adapters significantly drops without fully-trained lm_head or lm_head trained in LoRA style. This is functionality of peft modules_to_save=[lm_head, mebed_tokens] https://huggingface.co/docs/peft/v0.12.0/en/package_reference/&num;peft.LoraConfig.modules_to_save

The idea is to replace base_model VocabParallelEmbedding and ParallelLMHead by layers loaded from modules_to_save at inferencing LoRA

[x] dirty implementation
[x] tests for new functionality
[ ] checking old functionality is working
[x] inference with fully trained lm_head performance measurement

Sep 02 '24 11:09 sergeykochetkov

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Sep 02 '24 11:09 github-actions[bot]

/ready

Sep 11 '24 13:09 sergeykochetkov

should it unmarked as Draft ?

Sep 18 '24 14:09 AlongWY

This pull request has merge conflicts that must be resolved before it can be merged. @sergeykochetkov please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Oct 30 '24 12:10 mergify[bot]

/ready

Nov 01 '24 11:11 sergeykochetkov

should it unmarked as Draft ?

yes, i am waiting for review

Nov 01 '24 11:11 sergeykochetkov

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sergeykochetkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 12 '24 19:11 mergify[bot]

Great work! Some CIs show namespace inconsistency for new added symbols. I think it is time to fix and merge after such CIs passed. @youkaichao

Dec 07 '24 10:12 AaronZLT

Just wanted to throw out that this is something I am looking forward to.

I am attempting to use Qwen/Qwen2.5-14B as a base model, and load up two LoRA's with the OpenAI API. One of the LoRA's is just the Instruct model extracted as a LoRA from the base. The other model is a fine tune that I did off of the base, and used MergeKit to do a TIES merge with the base and instruct model, and then extracted an adapter from that merge.

Works great when I was testing with HF transformers, but was surprised when I was getting errors trying to use these adapters with vLLM.

Jan 02 '25 23:01 Tostino

PR is recreated here

Jan 04 '25 08:01 sergeykochetkov

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Apr 05 '25 02:04 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

May 05 '25 02:05 github-actions[bot]