vllm [Frontend] Add option for LLMEngine to return model hidden states.

Add return_hidden_states option in ModelConfig and allow hidden states to be passed up from SamplerOutput to RequestOutput for use by the client.

FIX #4435

Aug 27 '24 02:08 jdvin

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 27 '24 02:08 github-actions[bot]

Thanks a lot for your contribution. Would you like to provide snippet samples for using the hidden states - specifically, what does the returned hidden_states vector contain?

Aug 29 '24 18:08 BiEchi

Just figured it out - so the hidden states output vector is a concatenation of all the hidden states at the last layer. From the functional aspect I would strongly vote for merging this PR asap.

Aug 29 '24 21:08 BiEchi

Just figured it out - so the hidden states output vector is a concatenation of all the hidden states at the last layer. From the functional aspect I would strongly vote for merging this PR asap.

Yes, exactly. Though, you made me realise a mistake in how I was concatenating the hidden state. I have now fixed this and made the tests a bit more exacting.

Admittedly, I am not very happy with having to piggyback on an unrelated helper function for some of this logic. My goal was to modify as little core code as is necessary.

Aug 30 '24 00:08 jdvin

Thanks for the fix. I also find that return_hidden_states=True makes the GPU usages keeps going up when using your patch and do llm.generate. I guess it can be solved by transforming the tensors to CPU?

Aug 30 '24 03:08 BiEchi

Ah yes, I can see accumulating the tensors on the GPU being a bit of a problem for memory handling. Good catch! Just pushed a fix.

Aug 30 '24 03:08 jdvin

Confirmed that this fix solved the problem.

Aug 30 '24 03:08 BiEchi

/ready

Sep 03 '24 04:09 jdvin

@BiEchi Do you know if there is anything I need to do to be assigned a reviewer? I would love to get this merged.

Sep 19 '24 02:09 jdvin

@jdvin I think maybe you should @ one of the authors of this repo. I used your patch and it works nicely overall. One issue is that the output representations does not exactly match those extracted directly from the transformers lib, but they're close enough so I didn't encounter any major problems. I think it would be nice if you can look into this issue and well explain the reason to the authors. My conjecture is that the optimizations in vllm somewhat tweaks the inner representations for a little bit.

Oct 03 '24 02:10 BiEchi

@jdvin thanks for your patch! +1 this would be a great feature to merge! perhaps @liangfu could help review these?

Oct 04 '24 02:10 Aladoro

Sorry, I have no idea why merging main resulted in requesting review from everyone. I am presently working to make my tests pass with the new changes from main.

Edit: All passing except for AMD engine.

Oct 16 '24 02:10 jdvin

I appreciate this feature. I conducted some tests on your fork and noticed that the return_hidden_state does not seem to function with a sampling parameter n>1. Although I haven't gone over your code, and not a reviewer. But I still would like to ask whether it could it be possible to implement it within SamplingParams rather than the llm class?

Oct 29 '24 02:10 afafw

I appreciate this feature. I conducted some tests on your fork and noticed that the return_hidden_state does not seem to function with a sampling parameter n>1. Although I haven't gone over your code, and not a reviewer. But I still would like to ask whether it could it be possible to implement it within SamplingParams rather than the llm class?

@jdvin The PR has been stale for a few months and I just wonder if there is any progress on this problem?

Jan 13 '25 08:01 ControllableGeneration

@jdvin do you know if this will work out of the box with async llm engine in vllm?

Jan 20 '25 22:01 cosystudio

Add return_hidden_states option in ModelConfig and allow hidden states to be passed up from SamplerOutput to RequestOutput for use by the client.

FIX #4435 @jdvin Does it support returning hidden states of all tokens when enable prefix caching is true?

Mar 25 '25 07:03 zzy1026

Having the same need, would be happy to see this being merged asap.

Mar 29 '25 19:03 Hanpx20

Happy to help take this PR forward if needed, would be really helpful @jdvin

May 09 '25 18:05 rajasbansal

Happy to help take this PR forward if needed, would be really helpful @jdvin

Hi @rajasbansal - let me know if you want to collab. We also have an interest in getting this PR merged and used as a basis for returning hidden_states through the API layer.

May 12 '25 20:05 kyle-pena-kuzco

Hey all,

Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.

May 14 '25 02:05 jdvin

Hey all,

Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.

No worries! I'm working on a PR for this and hope to post in the next few days. Two aspects i'm also tackling are multi_step.py support (for speculative decoding) as well as the v1 engine.

May 14 '25 04:05 kyle-pena-kuzco

Hey all, Apologies for disappearing, I had written this for a particular use case at my last company and it had served its purpose. I initially wanted to get it merged but lost a bit of motivation when I did not receive any feedback from the vLLM maintainers. I am now taking a bit of time off from work (starting next week) and am happy to dedicate some of this time to bringing this branch up to speed with main and include any features (e.g., sampling n > 1). Hopefully we can get it merged.

No worries! I'm working on a PR for this and hope to post in the next few days. Two aspects i'm also tackling are multi_step.py support (for speculative decoding) as well as the v1 engine.

Great! Thank you for your input. Feel free to tag me as a reviewer 😄

May 14 '25 04:05 jdvin

please merge it, thanks, really need it !

May 23 '25 01:05 Dongximing

return hidden states should be an engine level feature and used as embedding models. it should not be a per-request parameter. i think @DarkLight1337 is working on embedding model support on v1?

May 23 '25 16:05 youkaichao

Yes, see #12249

May 23 '25 16:05 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jdvin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jul 23 '25 03:07 mergify[bot]

@jdvin yeah I think it's better to just close this PR as the system has modification restrictions. Very grateful to your contribution here, it helped many of us getting the states! For anyone using this feature in the future, please do react a like in the first commit to make people aware of this PR.

Jul 23 '25 12:07 BiEchi