temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Visibility data can become inconsistent

Open aromanovich opened this issue 1 year ago • 1 comments

Expected Behavior

Visibility data is eventually consistent.

Actual Behavior

In some of our self-hosted clusters we observed occasional inconsistency between the primary and the advanced visibilty storages, e.g. completed workflows being listed as running in ListWorkflowExecutions response until they're deleted by the retention policy. I guess the same issue was reported by other Temporal users in community and Slack.

Steps to Reproduce the Problem

As far as I understand, the immediate queue implementation and its FIFO scheduler do not guarantee any particular order of tasks execution. A queue reader selects tasks in batch, and the scheduler dispatches them among multiple workers (512 by default for visibility processor). And if a single task processing gets delayed (for any reason, could be the database or the workflow context being busy), tasks ahead could be executed and completed meanwhile.

For a single execution, transfer tasks are serialized by mutable state conditions and database concurrency control mechanisms; but visibility tasks seem to allow arbitrary order of execution, and the visibility storage doesn't implement any concurrency control.


https://github.com/temporalio/temporal/compare/main...aromanovich:temporal:visibility-race shows how this becomes an issue: go test . -persistenceType=sql -persistenceDriver=postgres12 -run 'AdvancedVisibilitySuite/TestVisibilityRace' -v will fail when SLOWDOWN_VISIBILITY_UPSERT=1 env var is set and succeed otherwise.

The test is very simple: it starts an execution, changes its search attributes and terminates it without any delays; then waits for some time and checks that the visibility data is consistent with the workflow state. SLOWDOWN_VISIBILITY_UPSERT=1 slows down UpsertExecutionVisibilityTask by one second, imitating an occasional persistence delay. This delay causes UpsertExecutionVisibilityTask to succeed after CloseExecutionVisibilityTask, which leaves the visibility data in an inconsistent state:

2024-04-02T00:19:01.581+0200	info	Sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:237"}
2024-04-02T00:19:01.602+0200	info	processCloseExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.600Z", "finishedAt": "2024-04-01T22:19:01.602Z", "logging-call-at": "visibility_queue_task_executor.go:256"}
2024-04-02T00:19:02.581+0200	info	Finished sleeping for 1 second	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "logging-call-at": "visibility_queue_task_executor.go:239"}
2024-04-02T00:19:02.585+0200	info	processUpsertExecution	{"shard-id": 2, "address": "127.0.0.1:7231", "component": "visibility-queue-processor", "startedAt": "2024-04-01T22:19:01.580Z", "finishedAt": "2024-04-01T22:19:02.585Z", "logging-call-at": "visibility_queue_task_executor.go:200"}

Specifications

  • Version: 1.22.5
  • Platform: All

aromanovich avatar Apr 02 '24 10:04 aromanovich

Just wanted to say I have seen this exact behavior as well. Was very confusing and worried that database updates may cause further issues like this when they have small amounts of downtime

mattbrandman avatar Oct 18 '24 20:10 mattbrandman

We are experiencing this same issue with a mysql8 backend for both the primary and visibility db's, would love to see it addressed as the web ui dashboards become cluttered with incorrect information. When clicking on the workflows, the correct status is shown on the details page.

dcaputo-harmoni avatar Nov 30 '24 04:11 dcaputo-harmoni

For anyone who is using mysql for visibility and would like a band aid fix for this problem, simply run:

UPDATE temporal_visibility.executions_visibility v
JOIN temporal.current_executions t ON v.workflow_id = t.workflow_id AND v.run_id = BIN_TO_UUID(t.run_id)
SET v.status = t.status, v.close_time = NOW(), v.execution_duration = CAST((NOW() - v.start_time) * 1000000000 AS SIGNED)
WHERE v.status = 1 AND v.status <> t.status;

This will bring your visibility workflow statuses back into sync with actual workflow statuses.

Or if you'd like to simply see the out-of-sync visibility workflows, you can use this query:

SELECT v.workflow_id, v.status AS v_status, t.status AS t_status
FROM temporal_visibility.executions_visibility v
JOIN temporal.current_executions t ON v.workflow_id = t.workflow_id AND v.run_id = BIN_TO_UUID(t.run_id)
WHERE v.status = 1 AND v.status <> t.status;

@jmbarzee let me know if I can help on the permanent fix in any way, as this is not a real solution to the problem.

dcaputo-harmoni avatar Dec 03 '24 17:12 dcaputo-harmoni

Data point - have the same issue with postgres12. Issue popped after upgrading from 1.21 to 1.24

aleskovets avatar Dec 06 '24 23:12 aleskovets

I have the same issue on temporal server 1.26 with primary database on cassandra and visibility on postgres 16. Looks like the problem is still alive.

genek96 avatar Jul 03 '25 05:07 genek96

I've faced this in 1.28.0 with Postgres 16 + strongly synchronized replicas, recently. Very rarely but still happens

shilenkoalexander avatar Aug 11 '25 09:08 shilenkoalexander