sdk-go
sdk-go copied to clipboard
Unable to reproduce `workflow.ErrSessionFailed`, session gets recreated with same ID after process restarts
Expected Behavior
-
workflow.ErrSessionFailed
is returned when attempting to reschedule an activity after the process on which a session exists is restarted.
Actual Behavior
- After process restarts, session is recreated with the same ID and workflows are scheduled on the same task queue as the previous session.
Steps to Reproduce the Problem
- Start a local worker that executes below workflow, make note of the session ID logged to screen
- Kill the local worker, wait ~1 minute (default session heartbeat timeout is 20 seconds, at which point the session should be marked as 'failed')
- Restart the local worker. The session is recreated with the same ID and
workflow.ErrSessionFailed
is not returned.
func MyWorkflow(
ctx workflow.Context,
) (error) {
ctx = workflow.WithActivityOptions(
ctx,
workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Minute,
HeartbeatTimeout: 1 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 1,
},
},
)
ctx, err := workflow.CreateSession(ctx, &workflow.SessionOptions{
ExecutionTimeout: math.MaxInt64,
CreationTimeout: 1 * time.Minute,
})
if err != nil {
return errors.Wrap(err, "creating session context")
}
fmt.Println("SESSION ID: " + workflow.GetSessionInfo(ctx).SessionID)
err = workflow.ExecuteActivity(
ctx,
MyActivity,
).Get(ctx, nil)
if err == nil {
return errors.Wrap(err, "no error found")
}
// kill the worker while first activity is in progress, then spin worker back up. this should cause second activity to execute
err = workflow.ExecuteActivity(
ctx,
MyActivity,
).Get(ctx, nil)
if errors.Is(err, workflow.ErrSessionFailed) {
return errors.Wrap(err, "found session failure error")
} else {
return errors.Wrap(err, "did not find session failure error")
}
}
func MyActivity(ctx context.Context) error {
time.Sleep(1 * time.Minute)
return nil
}
Specifications
- Version: v1.13.0
- Platform: M1 MacOS
Version: v1.13.0
@subhan-nadeem - We have made a lot of improvements since this version, most importantly #746. Can you upgrade and confirm that this issue persists?
Version: v1.13.0
@subhan-nadeem - We have made a lot of improvements since this version, most importantly #746. Can you upgrade and confirm that this issue persists?
I ran the above test on v1.18.0.
After killing the worker, the workflow event history reported an ActivityTaskTimedOut
failure for internalSessionCreationActivity
with message: activity ScheduleToStart timeout Caused By: activity Heartbeat timeout
(this seems like the intended behaviour)
After restarting the worker, the worker immediately reported a session failure, cancelled the hanging activity, and then the entire workflow (see logs below). This part I'm now confused about; is this intended behaviour? I would ideally like to catch the session failure in the workflow somehow and restart the workflow, rather than having it be cancelled.
12:11:34.508 Session failed log/with_logger.go:56
{
"Attempt": 1,
"Error": "activity error (type: internalSessionCreationActivity, scheduledEventID: 6, startedEventID: 0, identity: ): activity ScheduleToStart timeout (type: ScheduleToStart): activity Heartbeat timeout (type: Heartbeat)",
"Namespace": "default",
"RunID": "be99c277-f66e-4551-ad2e-829422505b09",
"TaskQueue": "TASK_QUEUE",
"WorkerID": "76929@LUSJCRXV6CV6K@",
"WorkflowID": "0bb2c03d-c02a-4cec-bd95-9aeb3ef03109",
"WorkflowType": "MyWorkflow",
"sessionID": "cc31806d-4f10-42df-9da1-3e03078d41b6"
}
12:11:34.508 RequestCancelActivity log/with_logger.go:56
{
"ActivityID": "11",
"Attempt": 1,
"Namespace": "default",
"RunID": "be99c277-f66e-4551-ad2e-829422505b09",
"TaskQueue": "TASK_QUEUE",
"WorkerID": "76929@LUSJCRXV6CV6K@",
"WorkflowID": "0bb2c03d-c02a-4cec-bd95-9aeb3ef03109",
"WorkflowType": "Myworkflow"
}
I did notice this however on v1.18.0:
- If I kill the worker with the session and start it back up after its
HeartbeatTimeout
and before itsCreationTimeout
, the session gets recreated with the same ID without any session failures reported (this seems like a bug?) - The activity I run reports a heartbeat error during the time the worker is down, and reschedules itself for retry on the same task queue it was scheduled on initially. Because the original session worker died, I believe that task queue disappears and the activity hangs in the "Scheduled" state forever, even after I restart the session worker
After restarting the worker, the worker immediately reported a session failure, cancelled the hanging activity, and then the entire workflow (see logs below). This part I'm now confused about; is this intended behaviour? I would ideally like to catch the session failure in the workflow somehow and restart the workflow, rather than having it be cancelled.
Yes, I believe this is the intended behavior and you would setup workflow retry policy to retry the workflow since your session failed. Or you can react to the session failure explicitly.
If I kill the worker with the session and start it back up after its HeartbeatTimeout and before its CreationTimeout, the session gets recreated with the same ID without any session failures reported (this seems like a bug?)
Is this after the session has started? Heartbeat timeout only applies after session start.
activity hangs in the "Scheduled" state forever, even after I restart the session worker
This may be a bug. The activity should be marked cancelled. I will check on this.
The activity I run reports a heartbeat error during the time the worker is down, and reschedules itself for retry on the same task queue it was scheduled on initially. Because the original session worker died, I believe that task queue disappears and the activity hangs in the "Scheduled" state forever, even after I restart the session worker
What steps did you take to cause this behavior? I was not able to reproduce this, but maybe I need to kill the worker at a specific time.
For the original issue, along with the suggestion @cretz made we have an open feature to make sessions handle worker restart smoother https://github.com/temporalio/sdk-go/issues/937.