go-workflows icon indicating copy to clipboard operation
go-workflows copied to clipboard

Protect against wrong ordering on events - events before the workflow start event occurs

Open yaananth opened this issue 2 years ago • 0 comments

Situation

This high level code:

activeJobs := set.NewSet[string]()  // section 1: lines
workflow.Go(ctx, func(ctx workflow.Context) {  // section 2: lines
   ...
   workflow.CreateSubWorkflowInstance[result.Conclusion](ctx, workflow.SubWorkflowOptions{ 
				InstanceID: jobInstanceID,....Get(ctx)  // section 2: lines, label: B
}
...
for activeJobs.Len() > 0 { // section 1: lines
   ...
   if err := workflow.SignalWorkflow(ctx, jobInstanceID, signals.Canceled,  // section 1: lines, label: A
   ...
}

Creates two go-routines (backed by co-routines) in go-workflows:

  • Handling section 1
  • Handling section 2

We proceed with the go-routine util it's blocking and then give the other go-routines a chance to move forward.

So, we can end up with the follow execution order: section 1 -> section 2

Meaning, for jobInstanceID, these are the events:

EventType_SignalReceived (from label: A) -> EventType_WorkflowExecutionStarted (from: label: B)

What happens

When that happens, we should handle the signal received first which calls e.workflow.Continue() which calls w.s.Execute()

But wait! workflow scheduler (that's the s) was never set! It gets set at NewWorkflow which is called from handleWorkflowExecutionStarted.

So we get

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd95e64]

goroutine 178 [running]:
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
        /go/pkg/mod/go.opentelemetry.io/otel/[email protected]/trace/span.go:359 +0x34
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000540000, {0x0, 0x0, 0x0})
        /go/pkg/mod/go.opentelemetry.io/otel/[email protected]/trace/span.go:398 +0xafb
panic({0x1236920, 0x1b27c30})
        /usr/local/go/src/runtime/panic.go:890 +0x267
github.com/cschleiden/go-workflows/internal/workflow.(*workflow).Continue(0x0)
        /go/pkg/mod/github.com/cschleiden/[email protected]/internal/workflow/workflow.go:88 +0x24

What should we do

We can either handle it and fail OR fix it.

Suggestion is to fix it, similar to what azure durable task did (Thanks @cschleiden for pointing this), this is fixing/band-aid after the situation occurs.

From @cschleiden: Maybe we could at least do that on the way in (preventing the situation, but fixing it than failing): https://github.com/cschleiden/go-workflows/blob/main/internal/history/grouping.go, if the events come in as part of same execution, this might need more thoughts

yaananth avatar Nov 17 '22 16:11 yaananth