[Frontend] Add option for LLMEngine to return model hidden states.
Add return_hidden_states option in ModelConfig and allow hidden states to be passed up from SamplerOutput to RequestOutput for use by the client.
FIX #4435
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
Thanks a lot for your contribution. Would you like to provide snippet samples for using the hidden states - specifically, what does the returned hidden_states vector contain?
Just figured it out - so the hidden states output vector is a concatenation of all the hidden states at the last layer. From the functional aspect I would strongly vote for merging this PR asap.
Just figured it out - so the hidden states output vector is a concatenation of all the hidden states at the last layer. From the functional aspect I would strongly vote for merging this PR asap.
Yes, exactly. Though, you made me realise a mistake in how I was concatenating the hidden state. I have now fixed this and made the tests a bit more exacting.
Admittedly, I am not very happy with having to piggyback on an unrelated helper function for some of this logic. My goal was to modify as little core code as is necessary.
Thanks for the fix. I also find that return_hidden_states=True makes the GPU usages keeps going up when using your patch and do llm.generate. I guess it can be solved by transforming the tensors to CPU?
Ah yes, I can see accumulating the tensors on the GPU being a bit of a problem for memory handling. Good catch! Just pushed a fix.
Confirmed that this fix solved the problem.
/ready
@BiEchi Do you know if there is anything I need to do to be assigned a reviewer? I would love to get this merged.
@jdvin I think maybe you should @ one of the authors of this repo. I used your patch and it works nicely overall.
One issue is that the output representations does not exactly match those extracted directly from the transformers lib, but they're close enough so I didn't encounter any major problems. I think it would be nice if you can look into this issue and well explain the reason to the authors. My conjecture is that the optimizations in vllm somewhat tweaks the inner representations for a little bit.
@jdvin thanks for your patch! +1 this would be a great feature to merge! perhaps @liangfu could help review these?
Sorry, I have no idea why merging main resulted in requesting review from everyone. I am presently working to make my tests pass with the new changes from main.
Edit: All passing except for AMD engine.
I appreciate this feature. I conducted some tests on your fork and noticed that the return_hidden_state does not seem to function with a sampling parameter n>1. Although I haven't gone over your code, and not a reviewer. But I still would like to ask whether it could it be possible to implement it within SamplingParams rather than the llm class?
I appreciate this feature. I conducted some tests on your fork and noticed that the return_hidden_state does not seem to function with a sampling parameter n>1. Although I haven't gone over your code, and not a reviewer. But I still would like to ask whether it could it be possible to implement it within SamplingParams rather than the llm class?
@jdvin The PR has been stale for a few months and I just wonder if there is any progress on this problem?
@jdvin do you know if this will work out of the box with async llm engine in vllm?
Add
return_hidden_statesoption inModelConfigand allow hidden states to be passed up fromSamplerOutputtoRequestOutputfor use by the client.FIX #4435 @jdvin Does it support returning hidden states of all tokens when enable prefix caching is true?
Having the same need, would be happy to see this being merged asap.
Happy to help take this PR forward if needed, would be really helpful @jdvin
Happy to help take this PR forward if needed, would be really helpful @jdvin
Hi @rajasbansal - let me know if you want to collab. We also have an interest in getting this PR merged and used as a basis for returning hidden_states through the API layer.
Hey all,
Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.
Hey all,
Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.
No worries! I'm working on a PR for this and hope to post in the next few days. Two aspects i'm also tackling are multi_step.py support (for speculative decoding) as well as the v1 engine.
Hey all, Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.
No worries! I'm working on a PR for this and hope to post in the next few days. Two aspects i'm also tackling are multi_step.py support (for speculative decoding) as well as the
v1engine.
Great! Thank you for your input. Feel free to tag me as a reviewer 😄
please merge it, thanks, really need it !
return hidden states should be an engine level feature and used as embedding models. it should not be a per-request parameter. i think @DarkLight1337 is working on embedding model support on v1?
Yes, see #12249
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jdvin.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@jdvin yeah I think it's better to just close this PR as the system has modification restrictions. Very grateful to your contribution here, it helped many of us getting the states! For anyone using this feature in the future, please do react a like in the first commit to make people aware of this PR.