Add model_load_time metric
What does this PR do?
Adds metric that measures the time spent in downloading the model, loading into GPU memory, and time it takes for the server to be ready to receive a request.
Because the router is the component that emits the metrics, but the launcher is the one who downloads the model and thus is where the download time is tracked, it is necessary to pass the duration from the launcher to the router. I considered passing it as a CLI argument, but to minimize the number of changes required opted to use an environment variable to do it instead. Open to suggestions as to how to better pass the value to the router.
To make it easier to work with Rust's Instant type I opted to measure two different values and then add them together: the time it takes to download to the model to launching the router; and the time it takes from launching the router to the router being ready to receive requests.
This PR is part of the metrics standardization effort.
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline, Pull Request section?
- [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. #1977
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
I'm still confused what's the interest to measure this one huge metric that encompasses so many different things:
- Model downloads, from where
- Potential model conversion
- Actual model load time which can be coming from disk or CPU RAM.
As a model runner, getting ready times can be interesting, but without more context it's quite useless, no ? What are users supposed to do with that metric ? It's also never being modified during the server's lifetime, so it's not really probing the system to do monitoring no ? In our monitoring systems, we only care about the logs showing insights as to what's happening, which contain every step, why they are occurring and how long each step takes.
I have nothing against the PR itself, but adding code without clear reasons for clear benefits is always a bit strange.
What are users supposed to do with that metric ?
because it is essentially the startup latency of the model, it is useful in determining the pod autoscaling threshold and frequency. So if model-load-time is above say 40 seconds, the user might not want to decrease the number of pods too often as it takes long to create a new pod once demand rises again.
cc @achandrasekar