[CORE] Adding support for insertion of soft-tuned prompts
This PR adds support for inserting soft-tuned prompts into the input embeddings (trained using PEFT).
This functionality is required by the IBM team.
Summary of Changes:
- New
prompt_adapterfolder similar to thelorafolder to create a LRU cache management system for multiple prompt adapters - New
prompt_adapter_configengine parameter - for easy extension to more sophisticated prompt-tuning techniques in the future - New
prompt_adapter_requestparameter added to the generate functionality of the LLM_Engine - Current support for several models like
bloom, llama, mistral, easily extensible for others - Simple test demonstrating that prompt adapters work for
bloom - Some parameter documentation still pending, will complete post successful code review
Hey @Yard1, I have addressed your comments on this soft prompt tuning PR. Some updates:
- New
adapter_commonsfolder with all the common code betweenLoRAandPrompt Adaptersabstracted - Reduced redundancy of having to add the adapter addition code in all model classes by extending the
VocalParallelEmbeddinglayer - New parameter
enable_prompt_adapterfor more consistency wLoRA - Added testing for multi-adapter inference
After your review, happy to add more extensive tests + docs.
@SwapnilDreams100 thanks! let me take a look
Hi @Yard1 just a friendly reminder to review this PR when you get a chance, thanks! Once this design is approved, happy to update this with support for prefix tuning as well, which should be similar in design!
Hi @SwapnilDreams100, I have an initial implementation of adapter support for the OpenAI entry points based on https://github.com/SwapnilDreams100/vllm/tree/main. Would you be open to me contributing to your PR?
@SwapnilDreams100 Sorry for the delay here, I should be able to review this before the end of week.
One question - is using both LoRA and prompt adapters for a single request a supported use case?
@Yard1 no worries, thank you! Yes it should work, you can provide both PromptAdapterRequest and LoRaRequest parameters. I just tested a tiny example of this, happy to add a test on this.
got it!
@SwapnilDreams100 openai entrypoint PR (https://github.com/SwapnilDreams100/vllm/pull/2) is ready for an initial review.
@SwapnilDreams100 @Yard1 if prompt adapters and lora modules can be used at the same time, should the model list returned by the api server list every possible adapter/module combination? If not, how should the user go about selecting adapter + module + model combos?
Thanks!
@SwapnilDreams100 @Yard1 I am very interested in this pull request working successfully. I have been testing the fork from Swapnil installing from source and while the tester code does seem to work with the bloomz-560m twitter classifier example, I can verify that it does not work with the modern implementations of PEFT and newer models like llama3. I have trained softprompts myself with llama3-8B and llama3-70B and verified the outputs using PEFT and huggingface transformers and they are working as expected, when I run the same softprompt using the configuration in the bloomz example the code is executing and there is some difference between the +/- prompt adapter, the prompt is clearly not be applied correctly and the outputs are pretty random. Please let me know if I can help more. I hope this can get merged soon, but important to verify it works with the models people are using today (llama, qwen, mistral, gemma, nemotron, etc).
Some extra information, the newer versions of PEFT have very different json structure and naming conventions when compared to the first implementations of prompt adapters in PEFT that bloomz example uses.
@g-eoj thanks for the OpenAPI PR! Will look at it once I finish the refactor requested by Antoni.
@Rallio67 thanks for testing the PR out. I have tested it with bloom, llama-2 and gptbigcode and it seemed to work for me. Can you please share the lama-3 adapter you are using which is not working?
Interesting detail on the newer JSON and naming conventions, I had wrote this a while back so it may have changed, let me check this.
@SwapnilDreams100 I have created a softprompt and some tester code that you can use to verify the method is working with llama-3 compatible models. I don't know what the issue might be, but llama-3 does use GQA instead of MHA for the attention for all model sizes. I can verify that the softprompt (uploaded as a .zip) and my python code using huggingface transformers and PEFT is working and generating the correct output. If you have your repo verified on your end as running the code equivalently I can do another verification from a fresh source install.
https://github.com/Rallio67/Softprompts-vllm
Thank you for working on this!
Hey @Rallio67, the prompt works but there is a change pending which would cast the prompt to the model's dtype to make it work out of the box. Currently the prompts are casted to float16, which is how my own prompts were trained.
So if you set dtype to float16 in the vLLM call: like so LLM(model="unsloth/llama-3-8b", enable_prompt_adapter=True, dtype= 'half') in the current version of the PR the prompt will work.
The dtype-agnostic prompt change will be made in the next day or so, and then this should not be necessary.
Hey @Rallio67, the prompt works but there is a change pending which would cast the prompt to the model's dtype to make it work out of the box. Currently the prompts are casted to float16, which is how my own prompts were trained. So if you set dtype to float16 in the vLLM call: like so
LLM(model="unsloth/llama-3-8b", enable_prompt_adapter=True, dtype= 'half')in the current version of the PR the prompt will work. The dtype-agnostic prompt change will be made in the next day or so, and then this should not be necessary.
I will test it out with that parameter when I get home in a few hours. If you were taking the bfloat16 values (which is format the prompt was saved in I think), then treating it directly as float16 would explain why the outputs were quite messed up. I will test it out and report back. I think vllm supports bfloat16 and float16 so hopefully the code will be able to handle using whatever data type the user works with.
I ran the bloom test code with llama-3-8b and my tester softprompt and can verify that the code works when:
llm = vllm.LLM(MODEL_PATH, enable_prompt_adapter=True, dtype = "half")
And the number of virtual tokens is changed to 20 from 8 (my adapter uses 20 virtual tokens). I think it will be good if you are able to have the user specify the precision, however for now the fp16 should be fine for most use cases so long as people are aware of this limitation. Bfloat16 is used pretty commonly also.
float16 is generally preferred for inference when possible - the better precision means there's less output variability caused by different batching.
Hey @Yard1 I have addressed most of your comments in the latest version. A couple of points need some more discussion, please clarify in their topics. Thanks!
Highlights:
- Automatic Dtype casting to model dtype instead of float16 casting
- Added test for both LoRA and PA used together in the same request - they can be used simultaneously, and cuda graphs work
- Changes to the
adapter_commons,loraandprompt_adapterAPIs to be cleaner. Thelorafunctions needed to be renamed, but it only impacts the model_runner class and one lora test file.
Hey @Yard1, thank you for your feedback! Pl review this version with your comments addressed: Highlights:
- Converted all
adapter_commonsclass methods to abstract, pushed common functions toutils - Embedding tensor now preallocated. Added a new config param
max_prompt_adapter_tokensabout the max adapter size used to allocate that tensor test_pa_loraupdated, now compares 2 lora requests, one with adapter and one w/o adapter
If max_prompt_adapter_tokens is less than num_virtual_tokens, there is an error like:
RuntimeError: The expanded size of the tensor (10) must match the existing size (128) at non-singleton dimension 0.
This message isn't intuitive but likely to be hit when using prompt adapters.
- can the case of
max_prompt_adapter_tokensis less thannum_virtual_tokensbe caught and a descriptive error message be shown? - should
max_prompt_adapter_tokenshave a default value? The user is more likely to set it correctly if they are forced to set it every time prompt adapters are enabled.
@SwapnilDreams100 I'm hitting this error now with an adapter setup that worked before the refactor.
ERROR 07-03 19:42:16 async_llm_engine.py:54] Engine background task failed
ERROR 07-03 19:42:16 async_llm_engine.py:54] Traceback (most recent call last):
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
ERROR 07-03 19:42:16 async_llm_engine.py:54] return_value = task.result()
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 601, in run_engine_loop
ERROR 07-03 19:42:16 async_llm_engine.py:54] result = task.result()
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 546, in engine_step
ERROR 07-03 19:42:16 async_llm_engine.py:54] request_outputs = await self.engine.step_async(virtual_engine)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 240, in step_async
ERROR 07-03 19:42:16 async_llm_engine.py:54] output = await self.model_executor.execute_model_async(
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
ERROR 07-03 19:42:16 async_llm_engine.py:54] return await self._driver_execute_model_async(execute_model_req)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 367, in _driver_execute_model_async
ERROR 07-03 19:42:16 async_llm_engine.py:54] results = await asyncio.gather(*tasks)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 352, in _run_task_with_lock
ERROR 07-03 19:42:16 async_llm_engine.py:54] return await task(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 07-03 19:42:16 async_llm_engine.py:54] result = self.fn(*self.args, **self.kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 348, in execute_method
ERROR 07-03 19:42:16 async_llm_engine.py:54] raise e
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 339, in execute_method
ERROR 07-03 19:42:16 async_llm_engine.py:54] return executor(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 270, in execute_model
ERROR 07-03 19:42:16 async_llm_engine.py:54] output = self.model_runner.execute_model(
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-03 19:42:16 async_llm_engine.py:54] return func(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1276, in execute_model
ERROR 07-03 19:42:16 async_llm_engine.py:54] hidden_or_intermediate_states = model_executable(
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 400, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54] model_output = self.model(input_ids, positions, kv_caches,
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 297, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54] hidden_states = self.get_input_embeddings(input_ids)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 282, in get_input_embeddings
ERROR 07-03 19:42:16 async_llm_engine.py:54] return self.embed_tokens(input_ids)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54] return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54] File "/usr/local/lib/python3.10/dist-packages/vllm/prompt_adapter/layers.py", line 80, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54] hidden_states[valid_mask] = gathered_embeddings
ERROR 07-03 19:42:16 async_llm_engine.py:54] IndexError: The shape of the mask [128] at index 0 does not match the shape of the indexed tensor [1, 5120] at index 0
Hmmm, @g-eoj can you share your test code?
For https://github.com/vllm-project/vllm/pull/4645#issuecomment-2207110714, removing the enforce eager flag causes the error to go away.
@SwapnilDreams100 I built from https://github.com/SwapnilDreams100/vllm/pull/2 and sent requests to the openai completion endpoint made with a command like:
python -m vllm.entrypoints.openai.api_server --model "TheBloke/Llama-2-13B-chat-AWQ" --enable-prompt-adapter --max-prompt-adapter-token=128 --prompt-adapters adapter=adapter --enforce-eager
Yeah the fact it is not working with enforce_eager=True (which means no CUDA graphs) shows that there is a serious issue. We should test both cases.
Hey @g-eoj and @Yard1, the eager issue has been fixed, it was unrelated to the forward graph code.
@SwapnilDreams100 first of all congrats!
I'd like to add tests for the openai server, would it be best to wait for this to merge so I can open my own PR against main?
Sounds good @g-eoj, big thank you for your help!
Hey @Yard1 are we good to merge?
Thanks for this epic effort @SwapnilDreams100!! And big thanks to @Yard1 for the many detailed reviews.
I'll merge it before any new conflicts pop up!
Big thank you to @Yard1 for your guidance on this, this was a great learning experience!