vllm Default to `generation_config` from model

Changes the default behaviour of the feature introduced in #11164. The new behaviour for --generation-config:

auto (default) - generation config is loaded from model path
vllm - vLLM defaults are used
{directory} - generation config is loaded from directory

As discussed in https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1737033846038699

Merge after:

https://github.com/vllm-project/vllm/pull/14223
https://github.com/vllm-project/vllm/pull/14225

Notes:

lm_eval test threshold was lowered because the generation_config.json for Qwen/Qwen2-1.5B-Instruct includes the non-default repetition_penalty: 1.1. lm_eval does not remove this because it is not part of the OpenAI API. The decision to lower the threshold was discussed in https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1741356496064409

Fixes #12096

Jan 31 '25 17:01 hmellor

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Jan 31 '25 17:01 github-actions[bot]

Good point, the latest change updates the default in:

EngineArgs (including the CLI arg)
ModelConfig

Feb 28 '25 12:02 hmellor

Failing tests are likely due to changes in sampling behaviour

Feb 28 '25 14:02 hmellor

Interestingly the "V1 Test" will timeout because ModelConfig.get_diff_sampling_param() is called for every request.

The slow part of ModelConfig.get_diff_sampling_param() is ModelConfig.try_get_generation_config(), which reads the default config from disk using GenerationConfig.from_pretrained. This is, obviously, very slow.

Unfortunately, this means that this is likely not as simple as changing the default, because the feature itself is broken.

Mar 03 '25 18:03 hmellor

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @hmellor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 05 '25 04:03 mergify[bot]

vllm vllm copied to clipboard

Default to `generation_config` from model

vllm
vllm copied to clipboard