sdk-go icon indicating copy to clipboard operation
sdk-go copied to clipboard

Canceling a context with multiple child contexts can be non deterministic

Open Quinn-With-Two-Ns opened this issue 1 year ago • 2 comments

Expected Behavior

Child context cancellation propagates in a deterministic order

Actual Behavior

Child context cancellation does not propagates in a deterministic order.

When a parent context cancels its' children it loops through a map. Map order in Go is not deterministic so the order child contexts are canceled is not deterministic.

Steps to Reproduce the Problem

func TestContextCancellationOrderDeterminism(t *testing.T) {
	/*
		Previously, child-contexts were stored in a map, preventing deterministic order when propagating cancellation.
		The order of branches being selected in this test was random, both for the first event and in following ones.

		In principle this should be fine, but it's possible for the effects of cancellation to trigger a selector's
		future-done callback, which currently records the *real-time*-first event as the branch to unblock, rather than
		doing something more safe by design (e.g. choosing based on state when the selector's goroutine is unblocked).

		Unfortunately, we cannot change the selector's behavior without introducing non-backwards-compatible changes to
		currently-working workflows.

		So the workaround for now is to maintain child-context order, so they are canceled in a consistent order.
		As this order was not controlled before, and Go does a pretty good job at randomizing map iteration order,
		converting non-determinism to determinism should be strictly no worse for backwards compatibility, and it
		fixes the issue for future executions.
	*/
	check := func(t *testing.T, separateStart, separateSelect bool) {
		env := newTestWorkflowEnv(t)
		act := func(ctx context.Context) error {
			return nil // will be mocked
		}
		wf := func(ctx Context) ([]int, error) {
			ctx, cancel := WithCancel(ctx)
			Go(ctx, func(ctx Context) {
				_ = Sleep(ctx, time.Minute)
				cancel()
			})

			// start some activities, which will not complete before the timeout cancels them
			ctx = WithActivityOptions(ctx, ActivityOptions{
				TaskQueue:              "",
				ScheduleToCloseTimeout: time.Hour,
				ScheduleToStartTimeout: time.Hour,
				StartToCloseTimeout:    time.Hour,
			})
			s := NewSelector(ctx)
			var result []int
			for i := 0; i < 10; i++ {
				i := i
				// need a child context, a future alone is not enough as it does not become a child
				cctx, ccancel := WithCancel(ctx)

				s.AddFuture(ExecuteActivity(cctx, act), func(f Future) {
					ccancel() // TODO: is this necessary to prevent leaks?  if it is, how can we make it not?
					err := f.Get(ctx, nil)
					if err == nil || !IsCanceledError(err) {
						// fail the test, this should not happen - activities must be canceled or it's not valid.
						t.Errorf("activity completion or failure for some reason other than cancel: %v", err)
					}
					result = append(result, i)
				})

				if separateStart {
					// yield so they are submitted one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
			}
			for i := 0; i < 10; i++ {
				if separateSelect {
					// yield so they are selected one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
				s.Select(ctx)
			}

			return result, nil
		}
		env.RegisterWorkflow(wf)
		env.RegisterActivity(act)

		// activities must not complete in time
		env.OnActivity(act, mock.Anything).After(5 * time.Minute).Return(nil)

		env.ExecuteWorkflow(wf)
		require.NoError(t, env.GetWorkflowError())
		var result []int
		require.NoError(t, env.GetWorkflowResult(&result))
		require.NotEmpty(t, result)
		assert.Equal(t, 0, result[0], "first activity to be created should be the first one canceled")
		assert.Equal(t, []int{1, 2, 3, 4, 5, 6, 7, 8, 9}, result[1:], "other activities should finish in a consistent (but undefined) order")
	}

	type variant struct {
		name           string
		separateStart  bool
		separateSelect bool
	}
	// all variants expose this behavior, but being a bit more exhaustive in the face
	// of decision-scheduling differences seems good.
	for _, test := range []variant{
		{"many in one decision", false, false},
		{"many started at once, selected slowly", false, true},
		{"started slowly, selected quickly", true, false},
		{"started and selected slowly", true, true},
	} {
		t.Run(test.name, func(t *testing.T) {
			check(t, test.separateStart, test.separateSelect)
		})
	}
}

Note: this test came from cadence go client where I noticed they had fixed this bug.

Quinn-With-Two-Ns avatar Aug 24 '23 15:08 Quinn-With-Two-Ns

Could this be fixed now using the new helper? https://github.com/temporalio/sdk-go/pull/1340

throwaway58383958484 avatar Mar 04 '24 16:03 throwaway58383958484

No since the keys of the map cannot be sorted. Likely we would just copy what Cadence did to resolve this problem https://github.com/uber-go/cadence-client/commit/dcaec7737070ebe0889f1c9bb57c2552c8bd7d86

Quinn-With-Two-Ns avatar Mar 04 '24 18:03 Quinn-With-Two-Ns