sdk-go icon indicating copy to clipboard operation
sdk-go copied to clipboard

Panic when querying during Workflow failure + retry

Open leowmjw opened this issue 4 years ago • 9 comments

Expected Behavior

No panic should occur when querying a registered QueryHandler; even when the workflow has failed and is retrying

Actual Behavior

After max attempt for Activity has passed; the whole Workflow has failed and is restarted.

Observe the panic "Attempt to generate a command before processing WorkflowTaskStarted event" due to "WorkerType WorkflowWorker Error operation GetWorkflowExecution encounter not found"

Logs

$ go run main.go ... 2021/06/21 23:31:45 ERROR Activity error. Namespace default TaskQueue onboarding.queue.name WorkerID [email protected]@ WorkflowID mleow-1 RunID a7f324da-ccf1-4f1a-97d3-f43df8af9260 ActivityType FailActivities Attempt 2 Error FAIL mleow-1 (type: TEMP, retryable: true) WID: mleow-1 CTX_ATTEMPT: 2

2021/06/21 23:31:45 ERROR Workflow panic Namespace default TaskQueue onboarding.queue.name WorkerID [email protected]@ WorkflowType RetryWorkflow WorkflowID mleow-1 RunID 07917e48-4dcf-447f-8e0c-9c55674051c5 Attempt 1 Error Attempt to generate a command before processing WorkflowTaskStarted event StackTrace coroutine root [panic]:

go.temporal.io/sdk/internal.(*commandsHelper).getNextID(...) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_decision_state_machine.go:824 go.temporal.io/sdk/internal.(*workflowEnvironmentImpl).GenerateSequence(...) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_event_handlers.go:453 go.temporal.io/sdk/internal.(*workflowEnvironmentImpl).ExecuteActivity(0xc00018d560, 0x0, 0x0, 0x1a58fa3, 0x15, 0x0, 0x0, 0xbebc200, 0x0, 0x0, ...) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_event_handlers.go:464 +0x696 go.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteActivity(0xc000180c30, 0x1b5d9f8, 0xc0004b61e0, 0x1ca2c81, 0xe, 0xc000182950, 0x1, 0x1, 0x0, 0x0) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/workflow.go:491 +0x6bd go.temporal.io/sdk/internal.ExecuteActivity(0x1b5d9f8, 0xc0004b61e0, 0x1929b60, 0x1a933b0, 0xc000182950, 0x1, 0x1, 0x0, 0x0) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/workflow.go:440 +0x128 go.temporal.io/sdk/workflow.ExecuteActivity(...) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/workflow/workflow.go:113 app/bug-wf-retry.RetryWorkflow(0x1b5d870, 0xc0001b6080, 0xc000410a20, 0x7, 0x0, 0x0) /Users/leow/GOMOD/testcase-temporal/bug-wf-retry/workflow.go:42 +0x487 reflect.Value.call(0x19351e0, 0x1a933b8, 0x13, 0x1a4a44d, 0x4, 0xc0004b6150, 0x2, 0x2, 0x2, 0x18, ...) /usr/local/Cellar/go/1.16.5/libexec/src/reflect/value.go:476 +0x8e7 reflect.Value.Call(0x19351e0, 0x1a933b8, 0x13, 0xc0004b6150, 0x2, 0x2, 0x1, 0x2, 0xc000180c30) /usr/local/Cellar/go/1.16.5/libexec/src/reflect/value.go:337 +0xb9 go.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteWorkflow(0xc000180c30, 0x1b5d870, 0xc0001b6080, 0xc0000c0c80, 0xd, 0xc0001828f0, 0x1, 0x1, 0x0, 0x0, ...) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/workflow.go:398 +0x2cb go.temporal.io/sdk/internal.(*workflowExecutor).Execute(0xc0001b6000, 0x1b5d870, 0xc0001b6080, 0xc000367d10, 0xc0004e5738, 0x1889225, 0x0) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker.go:740 +0x35a go.temporal.io/sdk/internal.(*syncWorkflowDefinition).Execute.func1(0x1b5d9f8, 0xc0004b6120) /Users/leow/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_workflow.go:494 +0xf5

2021/06/21 23:31:45 WARN Failed to process workflow task. Namespace default TaskQueue onboarding.queue.name WorkerID [email protected]@ WorkflowType RetryWorkflow WorkflowID mleow-1 RunID 07917e48-4dcf-447f-8e0c-9c55674051c5 Attempt 1 Error Attempt to generate a command before processing WorkflowTaskStarted event 2021/06/21 23:31:45 INFO Task processing failed with error Namespace default TaskQueue onboarding.queue.name WorkerID [email protected]@ WorkerType WorkflowWorker Error operation GetWorkflowExecution encounter not found

==================

Steps to Reproduce the Problem

  1. Workflow that runs activity with MaxAttempts = 2
  2. Once reach above max, activity fails and cause the workflow itself to fail and restart
  3. When above is happening, query is concurrently running

Code Snippet

Query

// Loop and query continuously ..
				time.Sleep(time.Second / 2)
				res, qerr := c.QueryWorkflow(ctx,
					fmt.Sprintf("mleow-%d", i), "",
					"CORE/current_state")
				if qerr != nil {
...
					continue
				}
				err := res.Get(&status)
				if err != nil {
					fmt.Println("GET_ERR:", err.Error())
				}
				fmt.Println("STATUS: ", status)

======

Worker

...
			// Start workflow
			wid := fmt.Sprintf("mleow-%d", id)
			wfo := client.StartWorkflowOptions{
				ID:                       wid,
				TaskQueue:                onboarding_patient.WorkflowQueue,
				WorkflowExecutionTimeout: 2 * time.Minute,
				WorkflowRunTimeout:       time.Minute,
				WorkflowTaskTimeout:      time.Second,
				RetryPolicy: &temporal.RetryPolicy{
					InitialInterval: time.Second,
					MaximumAttempts: 2,
				},
			}
			wfr, err := c.ExecuteWorkflow(context.Background(), wfo,
				bug_wf_retry.RetryWorkflow,
				wid,
			)
			if err != nil {
				panic(err)
			}
			fmt.Println("WFIF: ", wfr.GetID(), "RID: ", wfr.GetRunID())
...

=======

Workflow + Activity

func RetryWorkflow(ctx workflow.Context, wid string) error {
	fmt.Println("WID: ", wid, " CTX_ATTEMPT: ", workflow.GetInfo(ctx).Attempt)
	// Status to keep track; we return as we go along
	status := "Starting RetryWorkflow ..."
	herr := workflow.SetQueryHandler(ctx, "CORE/current_state", func() (string, error) {
		// Once completed/failed goes into replay state
		if workflow.IsReplaying(ctx) {
			fmt.Println("REPLAYING ...")
		}
		return status, nil
	})
	if herr != nil {
		status = "Failed to set QueryHandler!"
		return herr
	}
	// Activities
	ao := workflow.ActivityOptions{
		TaskQueue:           onboarding_patient.WorkflowQueue,
		StartToCloseTimeout: time.Millisecond * 200,
		RetryPolicy: &temporal.RetryPolicy{
			InitialInterval: time.Second,
			MaximumAttempts: 2,
		},
	}
	ctx = workflow.WithActivityOptions(ctx, ao)
	f := workflow.ExecuteActivity(ctx, FailActivities, wid)
	f.Get(ctx, nil)
	status = "After failed activity .."

	return fmt.Errorf("UNKNOWN!")
}

func FailActivities(wid string) error {
	fmt.Println("WID: " + wid + " FAIL :( :(")
	return temporal.NewApplicationError(
		fmt.Sprintf("FAIL %s", wid), "TEMP",
	)
}

==============

Specifications

  • Version: SDK v.1.7.0
  • Platform: Linux, docker-compose

leowmjw avatar Jun 21 '21 16:06 leowmjw

Are you seeing this error persistently (e.g. workflow is getting blocked) or is it cleared after the retry?

vitarb avatar Jun 23 '21 21:06 vitarb

@vitarb The workflow is not blocked and retry happens as expected; but this is just a toy case to reproduce.

I submitted as per requested by @samarabbas just to make sure this is not an edge case that might affect production/more complex scenarios.

leowmjw avatar Jun 24 '21 10:06 leowmjw

I have replicated this. In cases where a query is sent to a not-yet-started workflow (can happen during restart, cron, etc), the task polled has only a history with WorkflowExecutionStarted with the query and the SDK expects queries to only be handled after WorkflowTaskStarted. I am checking with @mfateev, @yiminc, @wxing1292, and others to determine the possible route to take here.

cretz avatar Dec 14 '21 14:12 cretz

Have also experienced a similar panic. It was for a cron workflow which spawns parallel activity executions and then waits for all of them to finish. Looks like the workflow is able to retry successfully later. Here is a sample, I removed two service specific frames in this backtrace for privacy:

external/io_temporal_go_sdk/internal/internal_workflow.go:499 +0xd2
go.temporal.io/sdk/internal.(*syncWorkflowDefinition).Execute.func1({0x2745f38, 0xc000439bc0})
	external/io_temporal_go_sdk/internal/internal_worker.go:741 +0x292
go.temporal.io/sdk/internal.(*workflowExecutor).Execute(0xc0034d6580, {0x2745de8, 0xc0034d6600}, 0x25)
	external/io_temporal_go_sdk/internal/workflow.go:415 +0x166
go.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteWorkflow(0xc00b7ab9f0, {0x2745de8, 0xc0034d6600}, 0xc0045cb7e8)
	external/io_temporal_go_sdk/internal/internal_worker.go:1512 +0x136
go.temporal.io/sdk/internal.executeFunction({0x11802e0, 0xc0003f1ee0}, {0xc001471e08, 0x1, 0xc001420000})
	GOROOT/src/reflect/value.go:339 +0xc5
reflect.Value.Call({0x11802e0, 0xc0003f1ee0, 0x403c6c}, {0xc0045cb800, 0x1, 0x1})
	GOROOT/src/reflect/value.go:543 +0x814
... service backtrace for workflow ...
go.temporal.io/sdk/workflow.ExecuteActivity(...)
	external/io_temporal_go_sdk/internal/workflow.go:455 +0x185
go.temporal.io/sdk/internal.ExecuteActivity({0x2745f38, 0xc000439d70}, {0x11cb080, 0x254c158}, {0xc000474fc0, 0x1, 0x1})
	external/io_temporal_go_sdk/internal/workflow.go:511 +0x782
go.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteActivity(0xc00b7ab9f0, {0x2745f38, 0xc000439dd0}, {0x29d3360, 0x6}, {0xc000474fc0, 0x1, 0x1})
	external/io_temporal_go_sdk/internal/internal_event_handlers.go:463 +0x545
go.temporal.io/sdk/internal.(*workflowEnvironmentImpl).ExecuteActivity(0xc000100e00, {{{0x0, 0x0}, {0xc00004bb00, 0xf}, 0x0, 0x0, 0x1a3185c5000, 0x0, 0x0, ...}, ...}, ...)
	external/io_temporal_go_sdk/internal/internal_event_handlers.go:452
go.temporal.io/sdk/internal.(*workflowEnvironmentImpl).GenerateSequence(...)
	external/io_temporal_go_sdk/internal/internal_decision_state_machine.go:833
go.temporal.io/sdk/internal.(*commandsHelper).getNextID(...)

tminusplus avatar Dec 16 '21 08:12 tminusplus

Are you regularly querying the workflow and getting "Attempt to generate a command before processing WorkflowTaskStarted event"?

It is currently a bug in the SDK where querying a workflow while it is waiting to start (via retry, cron, etc) is causing a panic. Once https://github.com/temporalio/temporal/issues/2300 is solved, the query should fail from the server side and not reach the worker to cause this panic.

cretz avatar Dec 16 '21 14:12 cretz

We are not querying the workflow at all. Essentially the workflow is doing the same as https://github.com/temporalio/samples-go/blob/main/branch/workflow.go#L34 where it spawns multiple concurrent activities.

One potential issue that might be causing it, is that we for-loop through the futures twice. This could be improved, but the intention was that we'd potentially spawn hundreds of activities from this workflow, so we wanted to batch them and ensure we only spawn a limited number of activities per a batch. I reworked the example to replicate our workflow:

func SampleBranchWorkflow(ctx workflow.Context, totalBranches int) (result []string, err error) {
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 10 * time.Second,
	}
	ctx = workflow.WithActivityOptions(ctx, ao)

        var futures []workflow.Future
        for batch := 0; batch < 10; batch++ {
	    for i := 1; i <= totalBranches; i++ {
		future := workflow.ExecuteActivity(ctx, SampleActivity, activityInput)
		futures = append(futures, future)
	    }
 
            // Block until this batch finishes processing
            for _, future := range futures {
		err = future.Get(ctx, nil)
		if err != nil {
			return
		}
	    }
        }

	// One last sweep to ensure all scheduled activities are complete
	for _, future := range futures {
            err = future.Get(ctx, nil)
	    if err != nil {
		return
	    }
	}
}

tminusplus avatar Dec 16 '21 20:12 tminusplus

Ah, that may be different (this issue is for queries during retry in particular). Can you open a new issue? Also, you are looping through issues more than twice, e.g. the second batch will loop through the first batch's, etc. Also, based on that code, that last for loop provides no value since you are doing the exact same thing as the last step in the previous loop that is guaranteed to have run 10 times before it got there. Maybe you want to add futures = futures[:0] as the first step of your first loop and remove the last loop?

cretz avatar Dec 16 '21 20:12 cretz

Update for those reading this issue. The primary issue here (last few comments a different thing) is https://github.com/temporalio/temporal/issues/2300.

cretz avatar Mar 03 '22 14:03 cretz

This may be solved with https://github.com/temporalio/temporal/pull/2826. Once released, we will write a test confirming that queries can be issued during workflow retry or cron or pending continue as new.

cretz avatar Jun 01 '22 13:06 cretz