common icon indicating copy to clipboard operation
common copied to clipboard

Proposal for exposing generic prometheus metrics in common operator

Open ywskycn opened this issue 6 years ago • 11 comments

Proposal

Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator

Motivation

To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.

Details

For metric definition and registry, will add a new metrics folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.

For metrics updating:

  • For pods/services, we can directly add related metric code inside job_controller/pod.go and job_controller/service.go.
  • For jobs, to track the numbers, we may need to watch the creation events. Similar to controller_watches.

As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .

ywskycn avatar Apr 23 '19 00:04 ywskycn

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.93. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Apr 23 '19 00:04 issue-label-bot[bot]

/cc @terrytangyuan

The feature LGTM.

gaocegege avatar Apr 23 '19 01:04 gaocegege

lgtm, +1

jian-he avatar Apr 23 '19 01:04 jian-he

Sounds great to me. This would be a good way to standardize metrics collection. We could also expose some utility methods that operators can use to collect operator-specific custom metrics, which leads to shared best practices and standards across operators.

terrytangyuan avatar Apr 23 '19 01:04 terrytangyuan

Sounds great to me.

/cc @jlewi

richardsliu avatar Apr 23 '19 06:04 richardsliu

Great. LGTM One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

johnugeorge avatar Apr 23 '19 06:04 johnugeorge

One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

Sure. kubebuilder supports the feature, thus I think we can also implement it in common-operator if we design it well.

gaocegege avatar Apr 23 '19 06:04 gaocegege

LGTM, this looks so good.

merlintang avatar May 14 '19 06:05 merlintang

Any progress for this issue?

yeya24 avatar Oct 18 '19 04:10 yeya24

@yeya24 AFAIK, there is no one working on it now.

gaocegege avatar Oct 18 '19 04:10 gaocegege

Hi all, I added a detailed outline of the Prometheus metrics we plan to coverage in common operator in https://github.com/kubeflow/common/pull/77. Please take a look and any feedback would be appreciated.

terrytangyuan avatar Apr 23 '20 20:04 terrytangyuan