flyte icon indicating copy to clipboard operation
flyte copied to clipboard

flyte_binary not exposing all of propeller metrics

Open rxnandakumar opened this issue 1 year ago • 5 comments

We have datadog agent scraping prometheus metrics on k8s for annotated pods. On exposing the metrics port and adding the annotations, we can see the metrics on datadog.

When trying to set up a dashboard based on the grafana propeller dashboard we found that the following metrics referenced in the grafana JSON are not exposed by flyte:

flyte:propeller:all:free_workers_count flyte:propeller:all:round:abort_error[5m] flyte:propeller:all:round:system_error_unlabeled[5m] flyte:propeller:all:node:plugin:.*_failure_unlabeled flyte:propeller:all:node:plugin:.*_success_unlabeled flyte:propeller:all:round:raw_unlabeled_ms[5m] flyte:propeller:all:round:raw_ms[5m] flyte:propeller:all:round:panic_unlabeled[5m] flyte:propeller:all:collector:flyteworkflow flyte:propeller:all:metastore:cache_hit flyte:propeller:all:metastore:cache_miss flyte:propeller:all:metastore:head_failure_unlabeled

We can only see the following in datadog when we search for 'propeller': flyte_admin_admin_builder_flytepropeller_build_failures.count flyte_admin_admin_builder_flytepropeller_build_successes.count flyte_admin_admin_execution_manager_propeller_failures.count

These seem to be flyte admin logs not propeller logs.

Expected result: All flyte propeller metrics should be exposed via the metrics port.

rxnandakumar avatar Jun 08 '23 14:06 rxnandakumar

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Jun 08 '23 14:06 welcome[bot]

Hey there, is there any update on this? Recently updated to latest Flyte-Binary Chart and Image and still no metrics show up. @davidmirror-ops

Sennuno avatar Dec 13 '23 16:12 Sennuno

@Sennuno no updates yet. I'll be working on this and will let you know once there's progress

davidmirror-ops avatar Dec 18 '23 13:12 davidmirror-ops

Is the issue still valid? I just took a look at the metrics from just the sandbox and I am seeing some stuff image

wild-endeavor avatar Apr 16 '24 01:04 wild-endeavor

@wild-endeavor Some are stilll missing for me in flyte-binary

  • flyte:propeller:all:round:abort_error (there does exist flyte:propeller:all:round:abort_error_unlabeled)
  • flyte:propeller:all:node:plugin:.*_failure_unlabeled (or anything with prefix flyte:propeller:all:node:plugin)
  • flyte:propeller:all:node:user_error_duration_ms_count
  • flyte:propeller:all:node:system_error_duration_ms_count

In addition, in the user dashboard, no workflows are being listed as the metric that is queried for the labels "label_values(flyte:propeller:all:collector:flyteworkflow, wf)", does not always have a wf key (only domain, endpoint, instance, job=, namespace, pod, project, service). This breaks the dashboard during down times since there are no workflows to select then.

cjidboon94 avatar Apr 19 '24 18:04 cjidboon94