sdk-java icon indicating copy to clipboard operation
sdk-java copied to clipboard

Aling information provided by metrics temporal_activity_schedule_to_start_latency_seconds_bucket and temporal_workflow_task_schedule_to_start_latency_seconds_bucket

Open antmendoza opened this issue 3 years ago • 8 comments

Expected Behavior

reported here https://community.temporal.io/t/workflow-task-metrics-only-have-labels-with-the-workflow-type-none-for-the-java-sdk/5942

There is no alignment in the tags pupulated by both metrics (temporal_activity_schedule_to_start_latency_seconds_bucket and temporal_workflow_task_schedule_to_start_latency_seconds_bucket). Either both should populate their respective tags, or none of them should. That is:

  • activity_type tag for temporal_activity_schedule_to_start_latency_seconds_bucket . Right now this metric has both tags activity_type and workflow_type populated.

i.e.: temporal_activity_schedule_to_start_latency_seconds_bucket{activity_type="PerformB",exception="none",namespace="default",operation="none",query_type="none",signal_name="none",status_code="none",task_queue="metricsqueue",workerCustomTag1="workerCustomTag1Value",workerCustomTag2="workerCustomTag2Value",worker_type="ActivityWorker",workflow_type="MetricsWorkflow",le="0.001",} 0.0

  • workflow_type tag for temporal_workflow_task_schedule_to_start_latency_seconds_bucket . Right now this metric doesn't have the tag workflow_type populated.

i.e.: temporal_workflow_task_schedule_to_start_latency_seconds_bucket{activity_type="none",exception="none",namespace="default",operation="none",query_type="none",signal_name="none",status_code="none",task_queue="metricsqueue",workerCustomTag1="workerCustomTag1Value",workerCustomTag2="workerCustomTag2Value",worker_type="WorkflowWorker",workflow_type="none",le="0.001",} 0.0

The documentation should be fixed:

  • https://docs.temporal.io/references/sdk-metrics/#workflow_task_schedule_to_start_latency
  • https://docs.temporal.io/references/sdk-metrics/#activity_schedule_to_start_latency

Actual Behavior

Steps to Reproduce the Problem

  1. run https://github.com/temporalio/samples-java/tree/main/src/main/java/io/temporal/samples/metrics
  2. navigate to worker metrics
  3. search for temporal_workflow_task_schedule_to_start_latency_seconds_bucket and temporal_activity_schedule_to_start_latency_seconds_bucket

Specifications

  • Version: sdk-java 1.16.0
  • Platform:

antmendoza avatar Sep 12 '22 16:09 antmendoza

Hi, considering one worker may service multiple workflow and activities types I would rather have everything populated. This seem reproduces self in Java SDK where for Go it worked fine with all the tags present (last time I checked)

Thanks, A

andrey-dubnik avatar Sep 12 '22 17:09 andrey-dubnik

This metric doesn't apply to workflow type (and activity type). "Schedule to start" is a metric of a poller. Poller works on a task queue level. Poller doesn't know about workflow or activity types and doesn't discriminate by them. So, this metric makes sense only for the whole task queue, not per workflow/activity types. Populating this field only creates more metrics with fewer samples for each of them, so damaging the sample size and blurring the meaning of the metric. "Type" tag on this metric is now just artificial for an underlying abstraction, but also detrimental to the usability of the metric.

We will discuss internally if we should just easy-fix this and populate workflow_type in Java. Or, actually, workflow_type and activity_type should be removed from this metric on all SDKs.

Spikhalskiy avatar Sep 12 '22 17:09 Spikhalskiy

Please also review https://docs.temporal.io/references/sdk-metrics/#workflow_task_queue_poll_succeed , the documentation mention that the tag workflow_type should be present but it isn't

temporal_workflow_task_queue_poll_succeed_total{activity_type="none",exception="none",namespace="default",operation="none",query_type="none",signal_name="none",status_code="none",task_queue="sticky",workerCustomTag1="workerCustomTag1Value",workerCustomTag2="workerCustomTag2Value",worker_type="none",workflow_type="none",} 3.0

antmendoza avatar Sep 12 '22 17:09 antmendoza

adding a suggestion per the conversation in the community forum

It is not clear from the documentation when the metric is worker/poller specific or a workflow/activity aware as there is a group of metrics specific to the worker which does not apply to the workflow/activity. It would be great if documentation enabled the simplified grouping of such metrics (e.g. additional tag) or had a separate section related to the poller/worker.

andrey-dubnik avatar Sep 13 '22 11:09 andrey-dubnik

We have also found that in Java SDK workflow_task_queue_poll_succeed counter doesn't have a workflow_type label. It is present in Go SDK and can be easily added in Java, unless there is some performance consideration.

https://github.com/temporalio/sdk-java/blob/15337d9a7cfd12ca1d3096034e017a82f167ea45/temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowPollTask.java#L172

See also response.getWorkflowType() a few lines above.

fitialovks avatar Jul 07 '23 14:07 fitialovks

There is no performance consideration. The reason it is not included is because this metric doesn't apply to workflow type because pollers works on a task queue level

Quinn-With-Two-Ns avatar Jul 07 '23 15:07 Quinn-With-Two-Ns

Would this be a correct summary of the workflow_task_queue_poll_succeed counter state?

  1. Java SDK does not have a workflow_type label because it's irrelevant for the poller.
  2. Go SDK and the dashboard example keep the label for backwards compatibility.

Here is the suggested dashboard https://github.com/temporalio/dashboards/blob/fe62b4e4567951497c511696b527b9037ef0a04f/sdk/sdk-general.json#L834

While irrelevant for the poller, the workflow type may be used for diagnostics. This seems to be the only metric marking a workflow task execution start. In case they are starting and not finishing (no success or failure events) I'd rather know which type it is. Current behavior may force us to use a queue per workflow just to get better metrics.

fitialovks avatar Jul 07 '23 15:07 fitialovks

Spoke with the team the conclusion was workflow_type label is only supported on the Go SDK and should be removed from the documentation since it is not relevant to pollers. I will open an issue with docs to remove.

@fitialovks for your use case you I think the best solution is a server side metric for workflow task timeout https://github.com/temporalio/temporal/issues/2526

Quinn-With-Two-Ns avatar Jul 07 '23 21:07 Quinn-With-Two-Ns