vllm [V1][Bugfix]: vllm v1 verison metric num_gpu

FIX https://github.com/vllm-project/vllm/issues/15719

Test result:

Mar 29 '25 15:03 lengrongfu

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Mar 29 '25 15:03 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 29 '25 15:03 mergify[bot]

@WoosukKwon can you take a look? thanks ~

Mar 31 '25 02:03 lengrongfu

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:

@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0

it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:

@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)

It's easy to imagine we will want to add more to this over time

Mar 31 '25 17:03 markmc

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:
@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0
it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:
@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)
It's easy to imagine we will want to add more to this over time

@WoosukKwon About this suggest what do you think?

Apr 01 '25 06:04 lengrongfu

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 01 '25 17:04 mergify[bot]

@markmc PTAL

Apr 07 '25 13:04 lengrongfu

I haven't reviewed closely, just added a few comments of things that I noticed.

@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing

@markmc it would probably be better to return this in the "ready" message that the engine sends to the front-end here once startup is complete, if possible.

Yeah, thanks. Just need to expand that I guess ...

This is the two side of that:

https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core.py#L488-L489

https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core_client.py#L433-L434

Either extend this binary protocol with an int:

message = b'READY' + struct.pack('!I', vllm_config.cache_config.num_gpu_blocks)

...
header = data[:5]
num_gpu_blocks = struct.unpack('!I', data[5:])[0]
if header != b'READY':
    ...
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks

or do it with JSON:

message_dict = {
    'type': 'READY',
    'num_gpu_blocks': vllm_config.cache_config.num_gpu_blocks,
}

message = json.dumps(message_dict).encode('utf-8')
....

data = socket.recv(1024)
message_dict = json.loads(data.decode('utf-8'))

if message_dict['type'] == 'READY':
    vllm_config.cache_config.num_gpu_blocks = message_dict['num_gpu_blocks']

Apr 08 '25 06:04 markmc

Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.

Apr 08 '25 16:04 njhill

Is it better to send a generic msg structure separately after sending the 'READY? @njhill @markmc It seems like modifying the current structure wouldn't be too bad. 😄

Apr 08 '25 16:04 lengrongfu

@lengrongfu I don't think a separate message is needed.

Apr 08 '25 16:04 njhill

Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.

About "Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index." this idea, the current prometheus metric vllm::cache_config does not support more than one engines. Does this structure also need to be modified?

Apr 08 '25 17:04 lengrongfu

@njhill @markmc please take a look, I re-implemented a version based on the discussed solution.

Apr 14 '25 03:04 lengrongfu

vllm
vllm copied to clipboard

[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None

vllm vllm copied to clipboard

[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None

vllm
vllm copied to clipboard