vllm
vllm copied to clipboard
[WIP][Metrics] Re-work approach to LoRA metrics
Part of #10582 and discussed in #12745
The current vllm:lora_requests_info Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration.
The label names used are:
- running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string.
- waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled.
- max_lora - the static "max number of LoRAs in a single batch." configuration.
It looks like this:
vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09
I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead:
vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0
This was added in #9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric.
TODO:
- [ ] Add a lora config info gauge - max_loras, max_lora_rank
- [ ] Add more unit test coverage of the new metrics
- [ ] Add the new metrics to V1
- [ ] Add the old metric to V0 (to ease the transition to V1)
- [ ] Deprecate the old metrics
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
You could argue either:
-
We don't need per-adapter counts at all, just an info metric (like cache_config_info) that lists the configured adapters, or
-
Most of our metrics should be per-adapter ... just break them down per-adapter and label with
lora_name=
See also #6275
ok, I took a closer look at what the Gateway API Inference Extension is doing with this metric. I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.
The premise is this:
route to a model server that has the adapter already loaded, so long as there is batch capacity
and the way the metric is described:
Metric name implemented in vLLM: vllm:lora_requests_info Metric type: Gauge Metric value: The last updated timestamp - so the Endpoint Picker (EPP) can find the latest Metric labels: max_lora: The maximum number of adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and canno load the requested adapter. Example: "max_lora": "8". running_lora_adapters: A comma separated list of adapters that are currently loaded in GPU memory and ready to serve requests. Example: "running_lora_adapters": "adapter1, adapter2"
References:
Given the way LRUCacheWorkerLoRAManager works, the current V0 metric implementation and what's proposed here in V1 both miss an important point - even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance.
Hi @markmc ! Thanks for doing this! I looked through https://github.com/vllm-project/vllm/issues/6275. On top of what you propose (adapters + counts) the metric proposed there look very informative,
from the RFC:
- Loading and unloading times for LoRA adapters.
- Memory and compute resource usage by LoRA adapters.
- Performance impact on base models when using LoRA adapters.
About,
Loading and unloading times for LoRA adapters.
I think this is good information. The max loras value combined with information about the number of running/waiting loras , will inform the users about the dynamics of LoRA loads. I believe the load times could serve as a good secondary information.
side-note: In V1, the LoRA adapter Loads are triggered here https://github.com/vllm-project/vllm/blob/38094584566b89210a6f72a408eba1fae43c3d81/vllm/v1/worker/gpu_model_runner.py#L581 . In this vein, i.e. informing the users about the input preparation time, I think we should consider exposing the run-time of _prepare_inputs function in general.
About,
Memory and compute resource usage by LoRA adapters.
For memory usage, we already profile the memory usage for a forward pass in determining the available memory for KV cache. However, it is not granular enough to inform about the memory usage by particular LoRA adapters. Also, since the max_loras engine argument effectively limits the number of LoRA adapters used, I believe the memory usage by LoRA adapters would be constant and wouldn't change during runtime.
For compute usage by LoRA adapters, I am not sure how we would do this at his granularity. Perhaps, we can add a "gpu-utilitzation" metric to Stats for the users to infer.
I believe this set of metrics is less important and we can tackle the memory and compute usage metrics after the first cut maybe.
About,
even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance. and, Performance impact on base models when using LoRA adapters.
I believe the existing Iteration level metrics (https://github.com/vllm-project/vllm/blob/38094584566b89210a6f72a408eba1fae43c3d81/vllm/engine/llm_engine.py#L1600) and what you proposed (adapters + counts) combined should inform the user of this.
Implemented the V0 metric in V1 in #13504
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @markmc.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.
Deferring for now, based on the feedback from above