overwatch
overwatch copied to clipboard
[WORKFLOWS] - Investigate jobs-observability metrics endpoint
API endpoint is available below in ungated private preview. Goal is to analyze the output and determine if the endpoint offers information that we either cannot otherwise get and/or if it simplifies some of the work being done.
$ curl -XGET \
-H 'Authorization: Bearer xxxxxxx’ \
-H 'Accept: application/openmetrics-text' \
'https://<databricks-host>/api/2.0/jobs-observability/metrics'
What is available and what is LOE / Value to implement
As discussed before the the result for above endpoint is not usable in our current overwatch scope. Below is the output for this endpoint:
# TYPE run_failures counter
# HELP run_failures Number of run failures by failure reason
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterInvalidRequest"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterDoesNotExist"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NoTaskDefined"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="RunException"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NotebookPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="Cancelled"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobUserPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="MaxConcurrentRunsExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="DriverError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ActiveRunLimitExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="InternalError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterPermissionError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="DbfsAccessError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="LibraryInstallationError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="WorkspaceNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="Unknown"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="InvalidRunConfiguration"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobTaskSkipped"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="UnauthorizedError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="NotebookNotFound"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterFeatureDisabled"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="ClusterRequestLimitExceeded"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobExecutionTimedOutUserError"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="JobTaskFailed"} 0.0
run_failures_total{workspace_id="2753962522174656",failure_reason="RepositoryCheckoutFailed"} 0.0
# TYPE run_starts counter
# HELP run_starts Number of runs since midnight
run_starts_total{workspace_id="2753962522174656"} 0.0
# TYPE run_successes counter
# HELP run_successes Number of run successes
run_successes_total{workspace_id="2753962522174656"} 0.0
# EOF
As we can see the above output is in raw text format and will not be useful for our overwatch tables. So I don' this there is any value by adding this implementation