kserve
kserve copied to clipboard
add prometheus metrics python sdk
What this PR does / why we need it:
Our users have come to us with a lot of questions regarding latencies in the kserve-container
. As of now, we have no way to know the latencies of each step in the transformer/predictor. Having observability into the latency of each step will let us quickly identify which step is the bottleneck when request latencies are high. At the moment, there is no easy way to figure this out.
This PR adds prometheus histogram latency metrics for each step (pre/post processing, explain, predict) in the python SDK. The metrics are exposed on /metrics. Additionally, we will log the latency of each step per request id. If you choose not to use prometheus metrics, evaluating the logs is another option.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):
Fixes #
Type of changes Please delete options that are not relevant.
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update
Feature/Issue validation/testing:
Please describe the tests that you ran to verify your changes and relevent result summary. Provide instructions so it can be reproduced. Please also list any relevant details for your test configuration.
-
[x] Test documented in transformer sample https://github.com/kserve/kserve/pull/2425/files#diff-a8d78136ac31d86dd82d5b2d24410ca6eb48265aae92c5ea63e360d56dfdf58aR148 I used the transformer to curl localhost:8080/metrics and see the prometheus metrics output.
-
Logs
Special notes for your reviewer:
- Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
Checklist:
- [x] Have you added unit/e2e tests that prove your fix is effective or that this feature works?
- [ ] Has code been commented, particularly in hard-to-understand areas?
- [x] Have you made corresponding changes to the documentation?
Release note:
Awesome work! @alexagriffith
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: alexagriffith, yuzisun
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [yuzisun]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@alexagriffith Can you help update the website doc for the new argument of the model server?
https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#arguments