vllm
vllm copied to clipboard
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None
FIX https://github.com/vllm-project/vllm/issues/15719
Test result:
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@WoosukKwon can you take a look? thanks ~
Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine
We currently return gpu_cache_usage in SchedulerStats:
@dataclass
class SchedulerStats:
gpu_cache_usage: float = 0.0
it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!
Also, looking at the way BlockPool.get_usage() works, we could easily do:
@dataclass
class GPUCacheStats:
num_blocks: float = 0.0
num_free_blocks: float = 0.0
@dataclass
class SchedulerStats:
gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)
It's easy to imagine we will want to add more to this over time
Great catch! Yes indeed,
num_gpu_blocksisn't currently available in the frontend, so we need some way of getting it from the engineWe currently return
gpu_cache_usageinSchedulerStats:@dataclass class SchedulerStats: gpu_cache_usage: float = 0.0it's tempting to include
num_gpu_blocksthere, even though it never changes. The overhead is small and the integration is easy!Also, looking at the way
BlockPool.get_usage()works, we could easily do:@dataclass class GPUCacheStats: num_blocks: float = 0.0 num_free_blocks: float = 0.0 @dataclass class SchedulerStats: gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)It's easy to imagine we will want to add more to this over time
@WoosukKwon About this suggest what do you think?
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@markmc PTAL
I haven't reviewed closely, just added a few comments of things that I noticed.
@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing
@markmc it would probably be better to return this in the "ready" message that the engine sends to the front-end here once startup is complete, if possible.
Yeah, thanks. Just need to expand that I guess ...
This is the two side of that:
https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core.py#L488-L489
https://github.com/vllm-project/vllm/blob/f6b32efb7f08a58e06cfee87c3dd241962bce3e1/vllm/v1/engine/core_client.py#L433-L434
Either extend this binary protocol with an int:
message = b'READY' + struct.pack('!I', vllm_config.cache_config.num_gpu_blocks)
...
header = data[:5]
num_gpu_blocks = struct.unpack('!I', data[5:])[0]
if header != b'READY':
...
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks
or do it with JSON:
message_dict = {
'type': 'READY',
'num_gpu_blocks': vllm_config.cache_config.num_gpu_blocks,
}
message = json.dumps(message_dict).encode('utf-8')
....
data = socket.recv(1024)
message_dict = json.loads(data.decode('utf-8'))
if message_dict['type'] == 'READY':
vllm_config.cache_config.num_gpu_blocks = message_dict['num_gpu_blocks']
Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.
Is it better to send a generic msg structure separately after sending the 'READY? @njhill @markmc It seems like modifying the current structure wouldn't be too bad. 😄
@lengrongfu I don't think a separate message is needed.
Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.
About "Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index." this idea, the current prometheus metric vllm::cache_config does not support more than one engines. Does this structure also need to be modified?
@njhill @markmc please take a look, I re-implemented a version based on the discussed solution.
Latest results.
Can you update this PR with main? Then I can stamp it
Ok, I will update in the later.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Looks like some tests are failing consistently, PTAL
Looks like some tests are failing consistently, PTAL
Hi, ci pipeline tests is success. @DarkLight1337
Please again take a look @DarkLight1337
I think with these changes, in the data parallel case (when there are multiple engines), the metrics will still only reflect the num_gpu_blocks from one of the engines. Typically the values would be similar across engines but we may want to consider whether they should be published individually with labels or aggregated in some other way.