serving icon indicating copy to clipboard operation
serving copied to clipboard

Queue-proxy scrape prometheus metrics from containers in pod

Open alexagriffith opened this issue 2 years ago • 4 comments

Describe the feature

There is an issue with configuring multiple prometheus ports from a single pod. Currently, prometheus does not support this use case, and we would like to make a workaround so that we can scrape metrics from queue-proxy + another container in a single pod.

Background: We are using knative with kserve. So right now, we have two containers in one pod - queue-proxy and the kserve-container. Queue-proxy emits prometheus metrics, and we want kserve-container to emit its own, distinct metrics- latency histograms for each step/method called in the kserve-container.

There is a Github issue describing the problem with prometheus and configuring scraping from multiple ports in a single pod. The hacky solution here is to have relabel config settings.

Another option we were thinking about is if we implemented a pattern similar to istio-proxy and have queue-proxy scrape the kserve-container and then send the prometheus metrics from queue-proxy. (see: Istio's Prometheus Scraping Standardization doc).

Curious if this is an issue others are experiencing and if there are any other ideas? Thanks!

alexagriffith avatar Sep 12 '22 17:09 alexagriffith

What's the goal to scarp the metrcis from kserve-container?

jwcesign avatar Sep 14 '22 02:09 jwcesign

We need to get latency metrics for each method in the kserve-container. This is really important for understanding where the bottleneck is, if there is one. For example - if a request in the kserve-container hits the pre_process, predict, and then post_process methods and the latency is super high, we currently have no visibility into which step is the bottleneck. Adding histogram metrics, for example, around each method would give us that visibility. Even if we only had one method running in another container along with queue-proxy, it would still be useful to have some metrics on the performance.

this is an issue due to the current limitations of prometheus, as noted in the github issue linked above.

alexagriffith avatar Sep 14 '22 19:09 alexagriffith

I think this is not a common use case. Maybe u can give a service port for user-container, and u can use like http get requests to scrap the metrics?

jwcesign avatar Sep 19 '22 01:09 jwcesign

I think this is not a common use case. Maybe u can give a service port for user-container, and u can use like http get requests to scrap the metrics?

also, here is a thread in the knative slack channel that may be helpful relating to this issue. https://knative.slack.com/archives/C93E33SN8/p1662483290964459

alexagriffith avatar Sep 19 '22 22:09 alexagriffith

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Dec 19 '22 01:12 github-actions[bot]