go-workflows
go-workflows copied to clipboard
Protect against wrong ordering on events - events before the workflow start event occurs
Situation
This high level code:
activeJobs := set.NewSet[string]() // section 1: lines
workflow.Go(ctx, func(ctx workflow.Context) { // section 2: lines
...
workflow.CreateSubWorkflowInstance[result.Conclusion](ctx, workflow.SubWorkflowOptions{
InstanceID: jobInstanceID,....Get(ctx) // section 2: lines, label: B
}
...
for activeJobs.Len() > 0 { // section 1: lines
...
if err := workflow.SignalWorkflow(ctx, jobInstanceID, signals.Canceled, // section 1: lines, label: A
...
}
Creates two go-routines (backed by co-routines) in go-workflows:
- Handling
section 1
- Handling
section 2
We proceed with the go-routine util it's blocking and then give the other go-routines a chance to move forward.
So, we can end up with the follow execution order: section 1
-> section 2
Meaning, for jobInstanceID
, these are the events:
EventType_SignalReceived
(from label: A
) -> EventType_WorkflowExecutionStarted
(from: label: B
)
What happens
When that happens, we should handle the signal received first which calls e.workflow.Continue()
which calls w.s.Execute()
But wait! workflow scheduler (that's the s
) was never set! It gets set at NewWorkflow which is called from handleWorkflowExecutionStarted.
So we get
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd95e64]
goroutine 178 [running]:
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
/go/pkg/mod/go.opentelemetry.io/otel/[email protected]/trace/span.go:359 +0x34
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000540000, {0x0, 0x0, 0x0})
/go/pkg/mod/go.opentelemetry.io/otel/[email protected]/trace/span.go:398 +0xafb
panic({0x1236920, 0x1b27c30})
/usr/local/go/src/runtime/panic.go:890 +0x267
github.com/cschleiden/go-workflows/internal/workflow.(*workflow).Continue(0x0)
/go/pkg/mod/github.com/cschleiden/[email protected]/internal/workflow/workflow.go:88 +0x24
What should we do
We can either handle it and fail OR fix it.
Suggestion is to fix it, similar to what azure durable task did (Thanks @cschleiden for pointing this), this is fixing/band-aid after the situation occurs.
From @cschleiden: Maybe we could at least do that on the way in (preventing the situation, but fixing it than failing): https://github.com/cschleiden/go-workflows/blob/main/internal/history/grouping.go, if the events come in as part of same execution, this might need more thoughts