overwatch icon indicating copy to clipboard operation
overwatch copied to clipboard

[WORKFLOWS] - Investigate jobs-observability metrics endpoint

Open GeekSheikh opened this issue 2 years ago • 1 comments

API endpoint is available below in ungated private preview. Goal is to analyze the output and determine if the endpoint offers information that we either cannot otherwise get and/or if it simplifies some of the work being done.

$ curl -XGET \
-H 'Authorization: Bearer xxxxxxx’ \
-H  'Accept: application/openmetrics-text' \
'https://<databricks-host>/api/2.0/jobs-observability/metrics'

GeekSheikh avatar Aug 16 '22 14:08 GeekSheikh

What is available and what is LOE / Value to implement

GeekSheikh avatar Oct 05 '22 18:10 GeekSheikh

As discussed before the the result for above endpoint is not usable in our current overwatch scope. Below is the output for this endpoint:

# TYPE run_failures counter
# HELP run_failures Number of run failures by failure reason
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterInvalidRequest"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterDoesNotExist"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NoTaskDefined"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="RunException"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NotebookPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="Cancelled"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobUserPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="MaxConcurrentRunsExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="DriverError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ActiveRunLimitExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="InternalError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="DbfsAccessError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="LibraryInstallationError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="WorkspaceNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="Unknown"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="InvalidRunConfiguration"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobTaskSkipped"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="UnauthorizedError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NotebookNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterFeatureDisabled"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterRequestLimitExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobExecutionTimedOutUserError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobTaskFailed"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="RepositoryCheckoutFailed"} 0.0
# TYPE run_starts counter
# HELP run_starts Number of runs since midnight
run_starts_total{workspace_id="2753962522174656"} 0.0
# TYPE run_successes counter
# HELP run_successes Number of run successes
run_successes_total{workspace_id="2753962522174656"} 0.0
# EOF

As we can see the above output is in raw text format and will not be useful for our overwatch tables. So I don' this there is any value by adding this implementation

souravbaner-da avatar Apr 27 '23 13:04 souravbaner-da