versatile-data-kit icon indicating copy to clipboard operation
versatile-data-kit copied to clipboard

control-service: Introduce job termination status counter

Open doks5 opened this issue 3 years ago • 1 comments

Currently, we expose "gauge" metrics for data job termination statuses, which we can then use to moonitor the operability of data jobs deployed in kubernetes clusters. This works fine for simple monitoring when we are looking for the current execution status of a job or its change over time.

However, if we want to aggregate data for all data jobs that we have deployed, for example check what percentage of all jobs fail with user or platform error compared to all job executions, things can get complicated.

This change introduces "counter" metrics that measure the number of specific statuses observed for each data job to help with situations when high-level picture of data job executions is needed.

Testing Done: Unit tests and existing tests.

Signed-off-by: Andon Andonov [email protected]

doks5 avatar Jul 19 '22 07:07 doks5

do we keep documentation of enlisted/summarized metrics?

Yes, please let's update https://github.com/vmware/versatile-data-kit/tree/main/projects/control-service/projects/helm_charts/pipelines-control-service#metrics

antoniivanov avatar Aug 08 '22 18:08 antoniivanov

Change got corrupted after rebase. Closing PR, as won't fix.

doks5 avatar Sep 13 '22 08:09 doks5