sdk-go icon indicating copy to clipboard operation
sdk-go copied to clipboard

Unable to reproduce `workflow.ErrSessionFailed`, session gets recreated with same ID after process restarts

Open subhan-nadeem opened this issue 2 years ago • 5 comments

Expected Behavior

  • workflow.ErrSessionFailed is returned when attempting to reschedule an activity after the process on which a session exists is restarted.

Actual Behavior

  • After process restarts, session is recreated with the same ID and workflows are scheduled on the same task queue as the previous session.

Steps to Reproduce the Problem

  1. Start a local worker that executes below workflow, make note of the session ID logged to screen
  2. Kill the local worker, wait ~1 minute (default session heartbeat timeout is 20 seconds, at which point the session should be marked as 'failed')
  3. Restart the local worker. The session is recreated with the same ID and workflow.ErrSessionFailed is not returned.
 func MyWorkflow(
	ctx workflow.Context,
) (error) {
	ctx = workflow.WithActivityOptions(
		ctx,
		workflow.ActivityOptions{
			StartToCloseTimeout: 10 * time.Minute,
			HeartbeatTimeout:    1 * time.Minute,
			RetryPolicy: &temporal.RetryPolicy{
				MaximumAttempts: 1,
			},
		},
	)

	ctx, err := workflow.CreateSession(ctx, &workflow.SessionOptions{
		ExecutionTimeout: math.MaxInt64,
		CreationTimeout:  1 * time.Minute,
	})
	if err != nil {
		return errors.Wrap(err, "creating session context")
	}

	fmt.Println("SESSION ID: " + workflow.GetSessionInfo(ctx).SessionID)

	err = workflow.ExecuteActivity(
		ctx,
		MyActivity,
	).Get(ctx, nil)
	if err == nil {
		return errors.Wrap(err, "no error found")
	}
// kill the worker while first activity is in progress, then spin worker back up. this should cause second activity to execute
	err = workflow.ExecuteActivity(
		ctx,
		MyActivity,
	).Get(ctx, nil)
	if errors.Is(err, workflow.ErrSessionFailed) {
		return errors.Wrap(err, "found session failure error")
	} else {
		return errors.Wrap(err, "did not find session failure error")
	}
}

func MyActivity(ctx context.Context) error {
	time.Sleep(1 * time.Minute)
	return nil
}

Specifications

  • Version: v1.13.0
  • Platform: M1 MacOS

subhan-nadeem avatar Oct 01 '22 03:10 subhan-nadeem

Version: v1.13.0

@subhan-nadeem - We have made a lot of improvements since this version, most importantly #746. Can you upgrade and confirm that this issue persists?

cretz avatar Oct 03 '22 14:10 cretz

Version: v1.13.0

@subhan-nadeem - We have made a lot of improvements since this version, most importantly #746. Can you upgrade and confirm that this issue persists?

I ran the above test on v1.18.0.

After killing the worker, the workflow event history reported an ActivityTaskTimedOut failure for internalSessionCreationActivity with message: activity ScheduleToStart timeout Caused By: activity Heartbeat timeout (this seems like the intended behaviour)

After restarting the worker, the worker immediately reported a session failure, cancelled the hanging activity, and then the entire workflow (see logs below). This part I'm now confused about; is this intended behaviour? I would ideally like to catch the session failure in the workflow somehow and restart the workflow, rather than having it be cancelled.

12:11:34.508 Session failed                                                        log/with_logger.go:56                                                           
{
  "Attempt": 1,
  "Error": "activity error (type: internalSessionCreationActivity, scheduledEventID: 6, startedEventID: 0, identity: ): activity ScheduleToStart timeout (type: ScheduleToStart): activity Heartbeat timeout (type: Heartbeat)",
  "Namespace": "default",
  "RunID": "be99c277-f66e-4551-ad2e-829422505b09",
  "TaskQueue": "TASK_QUEUE",
  "WorkerID": "76929@LUSJCRXV6CV6K@",
  "WorkflowID": "0bb2c03d-c02a-4cec-bd95-9aeb3ef03109",
  "WorkflowType": "MyWorkflow",
  "sessionID": "cc31806d-4f10-42df-9da1-3e03078d41b6"
}

12:11:34.508 RequestCancelActivity                                                 log/with_logger.go:56                                                           
{
  "ActivityID": "11",
  "Attempt": 1,
  "Namespace": "default",
  "RunID": "be99c277-f66e-4551-ad2e-829422505b09",
  "TaskQueue": "TASK_QUEUE",
  "WorkerID": "76929@LUSJCRXV6CV6K@",
  "WorkflowID": "0bb2c03d-c02a-4cec-bd95-9aeb3ef03109",
  "WorkflowType": "Myworkflow"
}

subhan-nadeem avatar Oct 04 '22 19:10 subhan-nadeem

I did notice this however on v1.18.0:

  • If I kill the worker with the session and start it back up after its HeartbeatTimeout and before its CreationTimeout, the session gets recreated with the same ID without any session failures reported (this seems like a bug?)
  • The activity I run reports a heartbeat error during the time the worker is down, and reschedules itself for retry on the same task queue it was scheduled on initially. Because the original session worker died, I believe that task queue disappears and the activity hangs in the "Scheduled" state forever, even after I restart the session worker

subhan-nadeem avatar Oct 04 '22 19:10 subhan-nadeem

After restarting the worker, the worker immediately reported a session failure, cancelled the hanging activity, and then the entire workflow (see logs below). This part I'm now confused about; is this intended behaviour? I would ideally like to catch the session failure in the workflow somehow and restart the workflow, rather than having it be cancelled.

Yes, I believe this is the intended behavior and you would setup workflow retry policy to retry the workflow since your session failed. Or you can react to the session failure explicitly.

If I kill the worker with the session and start it back up after its HeartbeatTimeout and before its CreationTimeout, the session gets recreated with the same ID without any session failures reported (this seems like a bug?)

Is this after the session has started? Heartbeat timeout only applies after session start.

activity hangs in the "Scheduled" state forever, even after I restart the session worker

This may be a bug. The activity should be marked cancelled. I will check on this.

cretz avatar Oct 10 '22 15:10 cretz

The activity I run reports a heartbeat error during the time the worker is down, and reschedules itself for retry on the same task queue it was scheduled on initially. Because the original session worker died, I believe that task queue disappears and the activity hangs in the "Scheduled" state forever, even after I restart the session worker

What steps did you take to cause this behavior? I was not able to reproduce this, but maybe I need to kill the worker at a specific time.

For the original issue, along with the suggestion @cretz made we have an open feature to make sessions handle worker restart smoother https://github.com/temporalio/sdk-go/issues/937.

Quinn-With-Two-Ns avatar Oct 21 '22 22:10 Quinn-With-Two-Ns