vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None

Open lengrongfu opened this issue 8 months ago • 12 comments

FIX https://github.com/vllm-project/vllm/issues/15719

Test result: image

lengrongfu avatar Mar 29 '25 15:03 lengrongfu

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Mar 29 '25 15:03 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 29 '25 15:03 mergify[bot]

@WoosukKwon can you take a look? thanks ~

lengrongfu avatar Mar 31 '25 02:03 lengrongfu

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:

@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0

it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:

@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)

It's easy to imagine we will want to add more to this over time

markmc avatar Mar 31 '25 17:03 markmc

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:

@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0

it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:

@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)

It's easy to imagine we will want to add more to this over time

@WoosukKwon About this suggest what do you think?

lengrongfu avatar Apr 01 '25 06:04 lengrongfu

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 01 '25 17:04 mergify[bot]

@markmc PTAL

lengrongfu avatar Apr 07 '25 13:04 lengrongfu

I haven't reviewed closely, just added a few comments of things that I noticed.

@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing

@markmc it would probably be better to return this in the "ready" message that the engine sends to the front-end here once startup is complete, if possible.

Yeah, thanks. Just need to expand that I guess ...

This is the two side of that:

https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core.py#L488-L489

https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core_client.py#L433-L434

Either extend this binary protocol with an int:

message = b'READY' + struct.pack('!I', vllm_config.cache_config.num_gpu_blocks)

...
header = data[:5]
num_gpu_blocks = struct.unpack('!I', data[5:])[0]
if header != b'READY':
    ...
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks

or do it with JSON:

message_dict = {
    'type': 'READY',
    'num_gpu_blocks': vllm_config.cache_config.num_gpu_blocks,
}

message = json.dumps(message_dict).encode('utf-8')
....

data = socket.recv(1024)
message_dict = json.loads(data.decode('utf-8'))

if message_dict['type'] == 'READY':
    vllm_config.cache_config.num_gpu_blocks = message_dict['num_gpu_blocks']

markmc avatar Apr 08 '25 06:04 markmc

Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.

njhill avatar Apr 08 '25 16:04 njhill

Is it better to send a generic msg structure separately after sending the 'READY? @njhill @markmc It seems like modifying the current structure wouldn't be too bad. 😄

lengrongfu avatar Apr 08 '25 16:04 lengrongfu

@lengrongfu I don't think a separate message is needed.

njhill avatar Apr 08 '25 16:04 njhill

Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.

About "Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index." this idea, the current prometheus metric vllm::cache_config does not support more than one engines. Does this structure also need to be modified?

lengrongfu avatar Apr 08 '25 17:04 lengrongfu

@njhill @markmc please take a look, I re-implemented a version based on the discussed solution.

lengrongfu avatar Apr 14 '25 03:04 lengrongfu

image Latest results.

lengrongfu avatar Apr 16 '25 09:04 lengrongfu

Can you update this PR with main? Then I can stamp it

DarkLight1337 avatar Apr 27 '25 13:04 DarkLight1337

Ok, I will update in the later.

lengrongfu avatar Apr 27 '25 13:04 lengrongfu

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 27 '25 13:04 mergify[bot]

Looks like some tests are failing consistently, PTAL

DarkLight1337 avatar Apr 28 '25 08:04 DarkLight1337

Looks like some tests are failing consistently, PTAL

Hi, ci pipeline tests is success. @DarkLight1337

lengrongfu avatar Apr 30 '25 01:04 lengrongfu

Please again take a look @DarkLight1337

lengrongfu avatar Apr 30 '25 10:04 lengrongfu

I think with these changes, in the data parallel case (when there are multiple engines), the metrics will still only reflect the num_gpu_blocks from one of the engines. Typically the values would be similar across engines but we may want to consider whether they should be published individually with labels or aggregated in some other way.

njhill avatar May 01 '25 15:05 njhill