vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1] LogitsProcessor programming model

Open afeldman-nm opened this issue 7 months ago • 8 comments

This PR is a continuation of the draft PR here

https://github.com/vllm-project/vllm/pull/13360

In the context of the vLLM v1 engine, this PR (1) defines the programming model for creating logits processors (via subclassing a base LogitsProcessor class), (2) converts hard-coded built-in logits processors (min-p, min token penalty and logits bias) into sub-classes of LogitsProcessor, and (3) introduces the logic for applying a list of logits processors sequentially (in-place) to the logits input.

This PR does not

  • Allow the user to extend vLLM V1 engine with custom logits processors ( #17799 )
  • Allow the user to pass in custom logits processor arguments via the OpenAI API ( #16862 )

Additional description (from #13360):

"Proposed abstraction for how to handle sampling parameters in relation to the persistent batch. This interface could then be used as an extension point for custom logits processors (note: see #16862 ).

Key goals/ideas:

Logits processor implementations are configured globally, we won't support per-request They apply at a batch level rather than per-request to allow for / encourage vectorized application Each logits processor encapsulates its own state and is responsible for updating it as needed based on notification of persistent batch updates and new output tokens each step. This minimizes the number of times tensors need to be reconstructed and updated on the GPU.

...I've implemented LPs for min_tokens, logit_bias and min_p, but if we decide to go this route it should be straightforward to refactor the others similarly...

class LogitsProcessor(ABC):
    @abstractmethod
    def apply(self, logits: torch.Tensor) -> torch.Tensor:
        raise NotImplementedError

    @abstractmethod
    def update_states(
        self,
        batch_update: Optional[BatchUpdate] = None,
    ) -> None:
        """Called when there are new output tokens, prior
        to each forward pass.
        Args:
            batch_update is non-None iff there have been
            changes to the batch makeup.
        """
        raise NotImplementedError
@dataclasses.dataclass
class BatchUpdate:
    # Batch indices of any removed requests.
    removed: List[int]
    # (from, to) batch indices of any requests
    # moved within the batch.
    moved: List[Tuple[int, int]]
    # (index, params, output_tok_ids) for new
    # requests added to the batch.
    #TODO may need to include one or two other things here, like prompt token ids.
    added: List[Tuple[int, SamplingParams, List[int]]]
    # The current number of requests in the batch.
    batch_size: int

@WoosukKwon @AlpinDale @houseroad

afeldman-nm avatar Apr 16 '25 15:04 afeldman-nm

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Apr 16 '25 15:04 github-actions[bot]

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Apr 16 '25 15:04 mergify[bot]

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Apr 23 '25 15:04 mergify[bot]

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Apr 30 '25 17:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 01 '25 14:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 07 '25 20:05 mergify[bot]

Thanks for your PR! I think the interface looks good to vLLM TPU.

For the sampler implementation, TPU might have some differences due to the speciality of compilation and inplace update.

cc @NickLucche

yaochengji avatar May 16 '25 21:05 yaochengji

Thanks @yaochengji @NickLucche for your feedback.

After discussion with @NickLucche I decided not to make the new logits processor programming model compatible with TPU, at least not as part of this PR; I will favor backward-compatibility instead. When vLLM is running on a TPU platform, then vLLM will use the current hard-coded logits processor implementations in vllm/v1/sample/tpu/sampler.py; but when vLLM is running on a non-TPU platform, it will use the new logits processor programming model introduced in this PR.

To that end, I revised the PR to remove tpu_input_batch.py and instead revised gpu_input_batch.py to be backward-compatible with the "old" TPU implementation. Mainly this means that on GPU, LogitsProcessor is responsible for implementing logits processors and managing logits processor state (new approach); while on TPU, the input batch manages logits processor state and the sampler implements logits processors (i.e. the current TPU implementation of logits processors).

This PR uses vllm.platforms.current_platform.is_tpu() to detect when vLLM is running on TPU.

CC @njhill

afeldman-nm avatar May 27 '25 10:05 afeldman-nm

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 03 '25 20:06 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 06 '25 12:06 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 19 '25 04:06 mergify[bot]

@afeldman-nm I think we should rework how the is_spec_decode_supported check is handled.

Specifically we can add string set of req ids to the input batch, and when new reqs are added to the batch and spec decoding is enabled, we pass the request object to some function which returns whether or not spec decoding supports it and then if not we add the req id to the set.

And then in the place where is_spec_decode_supported is currently called we just check whether the req id is in this set.

The LPs / LP state doesn't need to be involved at all.

njhill avatar Jun 25 '25 20:06 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 02 '25 01:07 mergify[bot]