vllm [CORE] Adding support for insertion of soft-tuned prompts

This PR adds support for inserting soft-tuned prompts into the input embeddings (trained using PEFT).

This functionality is required by the IBM team.

Summary of Changes:

New prompt_adapter folder similar to the lora folder to create a LRU cache management system for multiple prompt adapters
New prompt_adapter_config engine parameter - for easy extension to more sophisticated prompt-tuning techniques in the future
New prompt_adapter_request parameter added to the generate functionality of the LLM_Engine
Current support for several models like bloom, llama, mistral, easily extensible for others
Simple test demonstrating that prompt adapters work for bloom
Some parameter documentation still pending, will complete post successful code review

May 07 '24 05:05 SwapnilDreams100

Hey @Yard1, I have addressed your comments on this soft prompt tuning PR. Some updates:

New adapter_commons folder with all the common code between LoRA and Prompt Adapters abstracted
Reduced redundancy of having to add the adapter addition code in all model classes by extending the VocalParallelEmbedding layer
New parameter enable_prompt_adapter for more consistency w LoRA
Added testing for multi-adapter inference

After your review, happy to add more extensive tests + docs.

Jun 05 '24 16:06 SwapnilDreams100

@SwapnilDreams100 thanks! let me take a look

Jun 05 '24 23:06 Yard1

Hi @Yard1 just a friendly reminder to review this PR when you get a chance, thanks! Once this design is approved, happy to update this with support for prefix tuning as well, which should be similar in design!

Jun 10 '24 20:06 SwapnilDreams100

Hi @SwapnilDreams100, I have an initial implementation of adapter support for the OpenAI entry points based on https://github.com/SwapnilDreams100/vllm/tree/main. Would you be open to me contributing to your PR?

Jun 12 '24 19:06 g-eoj

@SwapnilDreams100 Sorry for the delay here, I should be able to review this before the end of week.

One question - is using both LoRA and prompt adapters for a single request a supported use case?

Jun 13 '24 17:06 Yard1

@Yard1 no worries, thank you! Yes it should work, you can provide both PromptAdapterRequest and LoRaRequest parameters. I just tested a tiny example of this, happy to add a test on this.

Jun 13 '24 18:06 SwapnilDreams100

got it!

Jun 13 '24 18:06 Yard1

@SwapnilDreams100 openai entrypoint PR (https://github.com/SwapnilDreams100/vllm/pull/2) is ready for an initial review.

@SwapnilDreams100 @Yard1 if prompt adapters and lora modules can be used at the same time, should the model list returned by the api server list every possible adapter/module combination? If not, how should the user go about selecting adapter + module + model combos?

Thanks!

Jun 14 '24 17:06 g-eoj

@SwapnilDreams100 @Yard1 I am very interested in this pull request working successfully. I have been testing the fork from Swapnil installing from source and while the tester code does seem to work with the bloomz-560m twitter classifier example, I can verify that it does not work with the modern implementations of PEFT and newer models like llama3. I have trained softprompts myself with llama3-8B and llama3-70B and verified the outputs using PEFT and huggingface transformers and they are working as expected, when I run the same softprompt using the configuration in the bloomz example the code is executing and there is some difference between the +/- prompt adapter, the prompt is clearly not be applied correctly and the outputs are pretty random. Please let me know if I can help more. I hope this can get merged soon, but important to verify it works with the models people are using today (llama, qwen, mistral, gemma, nemotron, etc).

Some extra information, the newer versions of PEFT have very different json structure and naming conventions when compared to the first implementations of prompt adapters in PEFT that bloomz example uses.

Jun 18 '24 00:06 Rallio67

@g-eoj thanks for the OpenAPI PR! Will look at it once I finish the refactor requested by Antoni.

Jun 18 '24 10:06 SwapnilDreams100

@Rallio67 thanks for testing the PR out. I have tested it with bloom, llama-2 and gptbigcode and it seemed to work for me. Can you please share the lama-3 adapter you are using which is not working?

Interesting detail on the newer JSON and naming conventions, I had wrote this a while back so it may have changed, let me check this.

Jun 18 '24 10:06 SwapnilDreams100

@SwapnilDreams100 I have created a softprompt and some tester code that you can use to verify the method is working with llama-3 compatible models. I don't know what the issue might be, but llama-3 does use GQA instead of MHA for the attention for all model sizes. I can verify that the softprompt (uploaded as a .zip) and my python code using huggingface transformers and PEFT is working and generating the correct output. If you have your repo verified on your end as running the code equivalently I can do another verification from a fresh source install.

https://github.com/Rallio67/Softprompts-vllm

Thank you for working on this!

Jun 18 '24 17:06 Rallio67

Hey @Rallio67, the prompt works but there is a change pending which would cast the prompt to the model's dtype to make it work out of the box. Currently the prompts are casted to float16, which is how my own prompts were trained. So if you set dtype to float16 in the vLLM call: like so LLM(model="unsloth/llama-3-8b", enable_prompt_adapter=True, dtype= 'half') in the current version of the PR the prompt will work. The dtype-agnostic prompt change will be made in the next day or so, and then this should not be necessary.

Jun 18 '24 18:06 SwapnilDreams100

Hey @Rallio67, the prompt works but there is a change pending which would cast the prompt to the model's dtype to make it work out of the box. Currently the prompts are casted to float16, which is how my own prompts were trained. So if you set dtype to float16 in the vLLM call: like so LLM(model="unsloth/llama-3-8b", enable_prompt_adapter=True, dtype= 'half') in the current version of the PR the prompt will work. The dtype-agnostic prompt change will be made in the next day or so, and then this should not be necessary.

I will test it out with that parameter when I get home in a few hours. If you were taking the bfloat16 values (which is format the prompt was saved in I think), then treating it directly as float16 would explain why the outputs were quite messed up. I will test it out and report back. I think vllm supports bfloat16 and float16 so hopefully the code will be able to handle using whatever data type the user works with.

Jun 18 '24 20:06 Rallio67

I ran the bloom test code with llama-3-8b and my tester softprompt and can verify that the code works when:

llm = vllm.LLM(MODEL_PATH, enable_prompt_adapter=True, dtype = "half")

And the number of virtual tokens is changed to 20 from 8 (my adapter uses 20 virtual tokens). I think it will be good if you are able to have the user specify the precision, however for now the fp16 should be fine for most use cases so long as people are aware of this limitation. Bfloat16 is used pretty commonly also.

Jun 19 '24 00:06 Rallio67

float16 is generally preferred for inference when possible - the better precision means there's less output variability caused by different batching.

Jun 19 '24 15:06 njhill

Hey @Yard1 I have addressed most of your comments in the latest version. A couple of points need some more discussion, please clarify in their topics. Thanks!

Highlights:

Automatic Dtype casting to model dtype instead of float16 casting
Added test for both LoRA and PA used together in the same request - they can be used simultaneously, and cuda graphs work
Changes to the adapter_commons, lora and prompt_adapter APIs to be cleaner. The lora functions needed to be renamed, but it only impacts the model_runner class and one lora test file.

Jun 24 '24 04:06 SwapnilDreams100

Hey @Yard1, thank you for your feedback! Pl review this version with your comments addressed: Highlights:

Converted all adapter_commons class methods to abstract, pushed common functions to utils
Embedding tensor now preallocated. Added a new config param max_prompt_adapter_tokens about the max adapter size used to allocate that tensor
test_pa_lora updated, now compares 2 lora requests, one with adapter and one w/o adapter

Jun 26 '24 14:06 SwapnilDreams100

If max_prompt_adapter_tokens is less than num_virtual_tokens, there is an error like:

RuntimeError: The expanded size of the tensor (10) must match the existing size (128) at non-singleton dimension 0.

This message isn't intuitive but likely to be hit when using prompt adapters.

can the case of max_prompt_adapter_tokens is less than num_virtual_tokens be caught and a descriptive error message be shown?
should max_prompt_adapter_tokens have a default value? The user is more likely to set it correctly if they are forced to set it every time prompt adapters are enabled.

Jul 03 '24 13:07 g-eoj

@SwapnilDreams100 I'm hitting this error now with an adapter setup that worked before the refactor.

ERROR 07-03 19:42:16 async_llm_engine.py:54] Engine background task failed
ERROR 07-03 19:42:16 async_llm_engine.py:54] Traceback (most recent call last):
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return_value = task.result()
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 601, in run_engine_loop
ERROR 07-03 19:42:16 async_llm_engine.py:54]     result = task.result()
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 546, in engine_step
ERROR 07-03 19:42:16 async_llm_engine.py:54]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 240, in step_async
ERROR 07-03 19:42:16 async_llm_engine.py:54]     output = await self.model_executor.execute_model_async(
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return await self._driver_execute_model_async(execute_model_req)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 367, in _driver_execute_model_async
ERROR 07-03 19:42:16 async_llm_engine.py:54]     results = await asyncio.gather(*tasks)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 352, in _run_task_with_lock
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return await task(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 07-03 19:42:16 async_llm_engine.py:54]     result = self.fn(*self.args, **self.kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 348, in execute_method
ERROR 07-03 19:42:16 async_llm_engine.py:54]     raise e
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 339, in execute_method
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return executor(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 270, in execute_model
ERROR 07-03 19:42:16 async_llm_engine.py:54]     output = self.model_runner.execute_model(
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return func(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1276, in execute_model
ERROR 07-03 19:42:16 async_llm_engine.py:54]     hidden_or_intermediate_states = model_executable(
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 400, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 297, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54]     hidden_states = self.get_input_embeddings(input_ids)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 282, in get_input_embeddings
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return self.embed_tokens(input_ids)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return self._call_impl(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-03 19:42:16 async_llm_engine.py:54]     return forward_call(*args, **kwargs)
ERROR 07-03 19:42:16 async_llm_engine.py:54]   File "/usr/local/lib/python3.10/dist-packages/vllm/prompt_adapter/layers.py", line 80, in forward
ERROR 07-03 19:42:16 async_llm_engine.py:54]     hidden_states[valid_mask] = gathered_embeddings
ERROR 07-03 19:42:16 async_llm_engine.py:54] IndexError: The shape of the mask [128] at index 0 does not match the shape of the indexed tensor [1, 5120] at index 0

Jul 03 '24 19:07 g-eoj

Hmmm, @g-eoj can you share your test code?

Jul 03 '24 20:07 SwapnilDreams100

For https://github.com/vllm-project/vllm/pull/4645#issuecomment-2207110714, removing the enforce eager flag causes the error to go away.

@SwapnilDreams100 I built from https://github.com/SwapnilDreams100/vllm/pull/2 and sent requests to the openai completion endpoint made with a command like:

python -m vllm.entrypoints.openai.api_server --model "TheBloke/Llama-2-13B-chat-AWQ" --enable-prompt-adapter --max-prompt-adapter-token=128 --prompt-adapters adapter=adapter  --enforce-eager

Jul 04 '24 01:07 g-eoj

Yeah the fact it is not working with enforce_eager=True (which means no CUDA graphs) shows that there is a serious issue. We should test both cases.

Jul 04 '24 01:07 Yard1

Hey @g-eoj and @Yard1, the eager issue has been fixed, it was unrelated to the forward graph code.

Jul 05 '24 19:07 SwapnilDreams100

@SwapnilDreams100 first of all congrats!

I'd like to add tests for the openai server, would it be best to wait for this to merge so I can open my own PR against main?

Jul 09 '24 12:07 g-eoj

Sounds good @g-eoj, big thank you for your help!

Jul 09 '24 13:07 SwapnilDreams100

Hey @Yard1 are we good to merge?

Jul 09 '24 20:07 SwapnilDreams100

Thanks for this epic effort @SwapnilDreams100!! And big thanks to @Yard1 for the many detailed reviews.

I'll merge it before any new conflicts pop up!

Jul 09 '24 20:07 njhill

Big thank you to @Yard1 for your guidance on this, this was a great learning experience!

Jul 09 '24 20:07 SwapnilDreams100