vllm icon indicating copy to clipboard operation
vllm copied to clipboard

v1: Add Request.block_hashes

Open orozery opened this issue 5 months ago • 4 comments

This PR move the request block hashes from the KVCacheManager to the Request object itself. In particular, this will allow connectors to access the request block hashes.

orozery avatar Jun 17 '25 06:06 orozery

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist[bot] avatar Jun 17 '25 06:06 gemini-code-assist[bot]

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Jun 17 '25 06:06 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 17 '25 06:06 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 19 '25 04:06 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 25 '25 19:06 mergify[bot]

My major concern is that the block_hash list is updated here

https://github.com/vllm-project/vllm/blob/d3f05c9248d79dc900c79d090db16cc2e5d96ee3/vllm/v1/core/block_pool.py#L177

It is quite far from Request class. It will make the number of element in the list magic to connectors, and hurt the modularity of the code. Therefore, if you want to put it into Request class, can you move this logic from "update when block hash is needed" to "update when token ids of the request are updated"?

Today, block_hashes are computed when the scheduler calls _update_waiting_for_remote_kv. Shortly after this call, calls are made to kv_cache_manager.get_computed_blocks and connector.get_num_new_matched_tokens. The connector must get the block hashes at this point.

I'm trying to interpret your suggestion "update when token ids of the request are updated". My best guess is that you mean request.append_output_token_ids. This is too late for the connector. I'm guessing that I do not understand you. Can you please be more concrete and reference to actual code locations?

orozery avatar Jul 06 '25 07:07 orozery

Yes I mean request.append_output_token_ids. Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late? https://github.com/vllm-project/vllm/blob/72d14d0eed4b29e5827519283c085a7a674f3256/vllm/v1/engine/core.py#L223-L240

heheda12345 avatar Jul 08 '25 06:07 heheda12345

Yes I mean request.append_output_token_ids. Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late?

https://github.com/vllm-project/vllm/blob/72d14d0eed4b29e5827519283c085a7a674f3256/vllm/v1/engine/core.py#L223-L240

@heheda12345 Consider the following scenario:

  1. Engine gets a fresh new prompt with 1000 (number does not matter) tokens.
  2. Scheduler pops the prompt from its self.waiting list.
  3. Scheduler checks whether this prompt has already computed tokens in the GPU prefix cache, by calling self.kv_cache_manager.get_computed_blocks. Assumes no hits are found, so num_computed_tokens = 0.
  4. Next, it queries the connector of any externally hit tokens, by calling self.connector.get_num_new_matched_tokens(request, ...). At this point, the connector needs the block hashes in order to check for hits.
  5. Assuming your approach of delaying the setting of request.block_hashes, the connector will yield num_external_computed_tokens = 0, even though it may have all the tokens available!
  6. The request tokens will be sent to the workers, and the workers will recompute the KV-values for this 1000 tokens.
  7. Only after the workers re-compute the tokens, the scheduler will call update_from_output, setting the block_hashes.

So to sum-up, because of the delay in setting of request.block_hashes, we lose the ability to utilize offloaded KV-values of tokens from external source.

orozery avatar Jul 08 '25 07:07 orozery

But I think the hash of the prompt can be initialized when the request is created

heheda12345 avatar Jul 08 '25 08:07 heheda12345

But I think the hash of the prompt can be initialized when the request is created

@heheda12345 The block hashes are computed inside BlockPool.cache_full_blocks. The Request object is created at EngineCore.add_request. Are you suggesting that we compute the hashes inside EngineCore.add_request ?

orozery avatar Jul 08 '25 08:07 orozery

Yes you can compute the hash in both add_request and append_output_token_ids. Though I prefer the solution in https://github.com/vllm-project/vllm/pull/15652 for hashing prompt tokens , I'm OK with computing hash in add_request in this PR and move it to frontend later.

heheda12345 avatar Jul 10 '25 12:07 heheda12345

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 11 '25 03:07 mergify[bot]

Yes you can compute the hash in both add_request and append_output_token_ids.

@heheda12345 I made the changes as you suggested.

orozery avatar Jul 21 '25 07:07 orozery

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 25 '25 03:07 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 29 '25 17:07 mergify[bot]

@heheda12345 @njhill Moved all hashing to be triggered by Request directly.

orozery avatar Aug 03 '25 18:08 orozery

Can you add more comments in hash_request_tokens

You mean in get_request_block_hasher

orozery avatar Aug 10 '25 06:08 orozery

I want to say move some comments in hash_request_tokens and cache_full_blocks to your new code.

heheda12345 avatar Aug 10 '25 06:08 heheda12345

@heheda12345 I made another set of changes. Thanks for taking the time giving out really helpful comments!

orozery avatar Aug 10 '25 07:08 orozery

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

njhill avatar Aug 11 '25 18:08 njhill

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

done

orozery avatar Aug 11 '25 19:08 orozery

Can you rebase with main again? It should fix the test_eagle_correctness failure.

heheda12345 avatar Aug 14 '25 22:08 heheda12345