flyte
flyte copied to clipboard
flyte_binary not exposing all of propeller metrics
We have datadog agent scraping prometheus metrics on k8s for annotated pods. On exposing the metrics port and adding the annotations, we can see the metrics on datadog.
When trying to set up a dashboard based on the grafana propeller dashboard we found that the following metrics referenced in the grafana JSON are not exposed by flyte:
flyte:propeller:all:free_workers_count flyte:propeller:all:round:abort_error[5m] flyte:propeller:all:round:system_error_unlabeled[5m] flyte:propeller:all:node:plugin:.*_failure_unlabeled flyte:propeller:all:node:plugin:.*_success_unlabeled flyte:propeller:all:round:raw_unlabeled_ms[5m] flyte:propeller:all:round:raw_ms[5m] flyte:propeller:all:round:panic_unlabeled[5m] flyte:propeller:all:collector:flyteworkflow flyte:propeller:all:metastore:cache_hit flyte:propeller:all:metastore:cache_miss flyte:propeller:all:metastore:head_failure_unlabeled
We can only see the following in datadog when we search for 'propeller':
flyte_admin_admin_builder_flytepropeller_build_failures.count flyte_admin_admin_builder_flytepropeller_build_successes.count flyte_admin_admin_execution_manager_propeller_failures.count
These seem to be flyte admin logs not propeller logs.
Expected result: All flyte propeller metrics should be exposed via the metrics port.
Thank you for opening your first issue here! 🛠
Hey there, is there any update on this? Recently updated to latest Flyte-Binary Chart and Image and still no metrics show up. @davidmirror-ops
@Sennuno no updates yet. I'll be working on this and will let you know once there's progress
Is the issue still valid? I just took a look at the metrics from just the sandbox and I am seeing some stuff
@wild-endeavor Some are stilll missing for me in flyte-binary
-
flyte:propeller:all:round:abort_error
(there does existflyte:propeller:all:round:abort_error_unlabeled
) -
flyte:propeller:all:node:plugin:.*_failure_unlabeled
(or anything with prefixflyte:propeller:all:node:plugin
) -
flyte:propeller:all:node:user_error_duration_ms_count
-
flyte:propeller:all:node:system_error_duration_ms_count
In addition, in the user dashboard, no workflows are being listed as the metric that is queried for the labels "label_values(flyte:propeller:all:collector:flyteworkflow, wf)"
, does not always have a wf
key (only domain, endpoint, instance, job=, namespace, pod, project, service
). This breaks the dashboard during down times since there are no workflows to select then.