versatile-data-kit
                                
                                
                                
                                    versatile-data-kit copied to clipboard
                            
                            
                            
                        control-service: Introduce job termination status counter
Currently, we expose "gauge" metrics for data job termination statuses, which we can then use to moonitor the operability of data jobs deployed in kubernetes clusters. This works fine for simple monitoring when we are looking for the current execution status of a job or its change over time.
However, if we want to aggregate data for all data jobs that we have deployed, for example check what percentage of all jobs fail with user or platform error compared to all job executions, things can get complicated.
This change introduces "counter" metrics that measure the number of specific statuses observed for each data job to help with situations when high-level picture of data job executions is needed.
Testing Done: Unit tests and existing tests.
Signed-off-by: Andon Andonov [email protected]
do we keep documentation of enlisted/summarized metrics?
Yes, please let's update https://github.com/vmware/versatile-data-kit/tree/main/projects/control-service/projects/helm_charts/pipelines-control-service#metrics
Change got corrupted after rebase. Closing PR, as won't fix.