xk6-browser Flaky timeout in `Frame.waitForExecutionContext()`

We recently noticed test failures in CI for PRs #326 and #480 (jobs #1183 and #1197 respectively), where different tests timeout after 5m.

The most relevant stack traces seem to be:

From #1183:

goroutine 14087 [select, 4 minutes]:
github.com/grafana/xk6-browser/common.(*Frame).waitForExecutionContext(0xc0009dac40, 0x1b5da7d, 0x4)
   /home/runner/work/xk6-browser/xk6-browser/common/frame.go:543 +0x345
github.com/grafana/xk6-browser/common.(*Frame).Evaluate(0xc0009dac40, 0x1d80388, 0xc0007d7d10, 0x0, 0x0, 0x0, 0xc000799dc8, 0x1)
   /home/runner/work/xk6-browser/xk6-browser/common/frame.go:911 +0x1e5
github.com/grafana/xk6-browser/common.(*Page).Evaluate(0xc00071adc0, 0x1d80388, 0xc0007d7d10, 0x0, 0x0, 0x0, 0x1d80388, 0xc0007d7d10)
   /home/runner/work/xk6-browser/xk6-browser/common/page.go:506 +0x16b
github.com/grafana/xk6-browser/tests.TestPageGotoWaitUntilDOMContentLoaded(0xc0007cf800)
   /home/runner/work/xk6-browser/xk6-browser/tests/page_test.go:194 +0x317
testing.tRunner(0xc0007cf800, 0x1bbc8f0)
   /opt/hostedtoolcache/go/1.16.15/x64/src/testing/testing.go:1203 +0x203
created by testing.(*T).Run
   /opt/hostedtoolcache/go/1.16.15/x64/src/testing/testing.go:1248 +0x5d8

And from #1197:

goroutine 17203 [select, 4 minutes]:
github.com/grafana/xk6-browser/common.(*Frame).waitForExecutionContext(0xc004fcf180, {0x18e922f, 0x4})
   /home/runner/work/xk6-browser/xk6-browser/common/frame.go:543 +0x2d7
github.com/grafana/xk6-browser/common.(*Frame).document(0xc004fcf180)
   /home/runner/work/xk6-browser/xk6-browser/common/frame.go:317 +0x197
github.com/grafana/xk6-browser/common.(*Frame).waitForSelector(0xc004fcf180, {0x18e9594, 0x5}, 0x4a171b)
   /home/runner/work/xk6-browser/xk6-browser/common/frame.go:640 +0x1b2
github.com/grafana/xk6-browser/common.(*Locator).waitFor(0xc005e7e390, 0xc0009fbae0)
   /home/runner/work/xk6-browser/xk6-browser/common/locator.go:624 +0x94
github.com/grafana/xk6-browser/common.(*Locator).WaitFor(0xc005e7e390, {0x1b1af38, 0xc005d71d70})
   /home/runner/work/xk6-browser/xk6-browser/common/locator.go:617 +0x3cb
github.com/grafana/xk6-browser/tests.TestLocator.func17.1()
   /home/runner/work/xk6-browser/xk6-browser/tests/locator_test.go:177 +0x85
github.com/stretchr/testify/assert.didPanic(0xc005d71fb0)
   /home/runner/go/pkg/mod/github.com/stretchr/[email protected]/assert/assertions.go:1018 +0xc9
github.com/stretchr/testify/assert.NotPanics({0x1ae5920, 0xc000c06820}, 0xc005d71fb0, {0x0, 0x0, 0x0})
   /home/runner/go/pkg/mod/github.com/stretchr/[email protected]/assert/assertions.go:1089 +0x89
github.com/stretchr/testify/require.NotPanics({0x1af66a8, 0xc000c06820}, 0xc005941e90, {0x0, 0x0, 0x0})
   /home/runner/go/pkg/mod/github.com/stretchr/[email protected]/require/require.go:1486 +0x9e
github.com/grafana/xk6-browser/tests.TestLocator.func17(0x1af66a8, {0x1b2b220, 0xc00556c6e0})
   /home/runner/work/xk6-browser/xk6-browser/tests/locator_test.go:177 +0x150
github.com/grafana/xk6-browser/tests.TestLocator.func18(0x0)
   /home/runner/work/xk6-browser/xk6-browser/tests/locator_test.go:188 +0x186
testing.tRunner(0xc005106680, 0xc0050fe5c0)
   /opt/hostedtoolcache/go/1.17.12/x64/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
   /opt/hostedtoolcache/go/1.17.12/x64/src/testing/testing.go:1306 +0x727

The code path is different, but both cases are stuck here: https://github.com/grafana/xk6-browser/blob/00960bdc677348b983fedb8860ba9424522440c1/common/frame.go#L543-L550

It seems that it keeps waiting for an execution context which never arrives.

Here are the complete stack traces of both jobs, in case they're deleted from CI: waitforexecution-timeouts.zip

Aug 04 '22 13:08 imiric

Original conversation in slack: https://loadimpact.slack.com/archives/C033REJN4MR/p1660661072059919?thread_ts=1660658327.545839&cid=C033REJN4MR

I have found that the context is nullified which seems to occur on the back of a CDP event (either EventExecutionContextDestroyed or EventExecutionContextsCleared). This occurs before waitForExecutionContext is called :confused: So another race condition.

setContext main for url about:blank with ectxid 2
setContext utility for url about:blank with ectxid 1
requestStarted
Tried to setContext main for url http://127.0.0.1:52016/static/locators.html with ectxid 3 but main is already set, so returns early
onExecutionContextCreated
null context ectxid 3 -- nothing happens
null context ectxid 4 -- nothing happens
null context ectxid 2 -- it clears the corresponding context
null context ectxid 1 -- it clears the corresponding context
waitForExecutionContext deadlock

Two observations from @imiric:

One thing that I find confusing and think is a bug, is step 4 of what you outline above. I.e. this: https://github.com/grafana/xk6-browser/blob/c7b281b404de0dd21d31d3902791ecbb272c1b45/common/frame.go#L518-L522 Why do we return early if the context with that executionWorld is already set? Frame.setContext() is called when a new exec ctx is created before the frame navigates, so we should replace it with the newly created one, even if they're for the same executionWorld. I'm not sure why we haven't run into bugs because of this before, but I had a related issue during my refactoring and removing this fixed it.

I think that if waitForExecutionContext blocks, that means that all contexts were cleared, no new navigations happened, and the frame is effectively dead. So the frame context should be cancelled at that point, and this should unblock it, and we should return gracefully in some way.

So the next step is to work out why the context is not cancelled and cancel it when the browser context is cleared.

Aug 18 '22 11:08 ankur22

I have tried two fixes, but neither of them are suitable:

https://github.com/grafana/xk6-browser/pull/512: This adds a locking mechanism to ensure that the clearContext event is worked on before the createContext. I believe that this isn't maintainable and adds too much complexity to solve this one issue. There's a chance that this will not work if the events are seconds off from each other.
https://github.com/grafana/xk6-browser/pull/524: This hits at the root of the issue, which is trying to ensure that the order of the incoming events is kept throughout the system. It's a simpler solution and easier to reason about. The issues are:
- It requires a buffered channels, which basically hide other deadlocks until they fill up in complex tests.
- The ordering of the handling of events is changed and no one is sure how that will behave in the wild.
- Another option could have been to spin up goroutines in the handler themselves whenever they're likely to work with the event loop, but this isn't maintainable and again we don't fully understand how this could affect the behaviour of the system.

An alternative proposal is now being worked on which is to use a FSM within the handler that is handling the EventExecutionContextsCleared and EventExecutionContextCreated. This FSM will be used to check the current state of the system, validate the current event that is being handled, action it or store it for later use.

Sep 27 '22 14:09 ankur22

In the sequence diagram (png) I've outlined how the stateful/FSM handler system will intercept and deal with incoming events. Here are the details of the actors:

Init is a generalisation on a processes which involves creating a new StatefulHandler and a FSM, which will be created alongside the current handler.
Emit is the BaseEventEmitter.emit, and more specifically the emitEvent closure.
StatefulHandler is the new handler that will wrap the existing handler to intercept incoming events and hand them to the existing handler in the correct order.
FSM is currently undefined in how its internals will work, and may also become part of StatefulHandler, but in the diagram it:
- knows the valid states and transitions;
- keeps track of the current state;
- stores the out of order events;
- returns the stored events when asked for them.
The Handler is the current handler.

Sep 27 '22 14:09 ankur22

Thanks for the nice sequence diagram :blush::thumbsup:

I think your design looks good at a glance. It's difficult to analyze without seeing some PoC code, but here are some initial thoughts:

The state transition DAG should make a createContext -> createContext transition invalid. Yet we do get that scenario for all new frames; one createContext for the main execution world and one for utility.

So we'll either need to parametrize the event for each execution context, or use two separate FSMs. :thinking: I think separate ones wouldn't make sense, since it's possible to receive the ExecutionContextsCleared event, which should delete all execution contexts... Not sure, but we'll need to handle this somehow.
What would be the purpose of storing out of order events? Will we try to sort and reapply them in some way?

The only reason they would be out of order is if we internally call event handlers out of order (i.e. what you were trying to fix in #524), but otherwise we should receive them in the correct order from the browser. If we don't, then that's worth erroring/panicking about and aborting the iteration.

So assuming we can maintain the order from CDP, I think we should raise an error in this case. For createContext/clearContext it would be fairly trivial to reorder, but in more complex cases (and arguably even in this one) we shouldn't apply patches to known buggy behavior, but try to fix it if possible.

Sep 28 '22 12:09 imiric

@imiric thanks for reviewing it. Yep, I should have a PR ready soon for you to look at.

The state transition DAG should make a createContext -> createContext transition invalid. Yet we do get that scenario for all new frames; one createContext for the main execution world and one for utility.

Yes, good point. I didn't add that in the diagram, but well aware of it. I've added a TODO in my branch so that I don't forgot though.

What would be the purpose of storing out of order events? Will we try to sort and reapply them in some way?

We need to be able to detect that the event are out of order, and remember that this has occurred so that we can "replay" the out of order event later. Not sure how else I could do that 🤔

So assuming we can maintain the order from CDP, I think we should raise an error in this case. For createContext/clearContext it would be fairly trivial to reorder, but in more complex cases (and arguably even in this one) we shouldn't apply patches to known buggy behavior, but try to fix it if possible.

Not sure I understand what you're suggesting here. AFAICT, all out of order CDP events for this particular scenario (clearContext/createContext) are to do with how we're placing the events on the goroutine internally during emit, which https://github.com/grafana/xk6-browser/pull/524 tried to solve. So, this current implementation of a StatefulHandler and FSM will assume that all incoming events to the handler are out of order for that reason, and not because Chrome is sending out of order CDP events.

If we want to detect out of order CDP events then we will need to place a FSM in the event_emitter.go file, and yep, I agree we should error/panic in that case, but I think that's out of scope for this fix.

Sep 28 '22 13:09 ankur22

Yes, good point. I didn't add that in the diagram, but well aware of it. I've added a TODO in my branch so that I don't forgot though.

:thumbsup:

We need to be able to detect that the event are out of order, and remember that this has occurred so that we can "replay" the out of order event later. Not sure how else I could do that

OK, that's what I was wondering about; how we planned to handle these out of order events.

I was loosely suggesting that we should probably first address the issue that we internally receive events out of order, either with something like #524, or, since you mentioned that approach eventually leads to different deadlocks, something else. I.e. we should ensure that both CDP and internal events are received in the same order they're emitted. Intuitively, a simple global channel might do the trick, but we'll need to experiment and see what works best.

Once that issue is resolved, then we wouldn't need patches for out-of-order events, figuring out how to sort and reapply them, etc., and we could assume that any such scenario can raise an error and end the iteration.

I think this would address two critical issues, without adding workarounds we'll have to (hopefully) remove later, and simplifying this fix for addressing functional race conditions.

Sep 29 '22 08:09 imiric

Intuitively, a simple global channel might do the trick, but we'll need to experiment and see what works best.

Sounds like we're talking about different solutions. Let's discuss in the catch up and I'll update it here once we've come to an agreement.

Sep 29 '22 09:09 ankur22

Ok, WDYT?:

To ensure that the event publishing goroutines don't break the ordering of the emitted events, we could get the emitEvent goroutines to synchronise. What I mean by this is that the events could be stored in a queue per handler. So now the goroutine worker would pop from the handler's queue the job (event and data), and at any one time only one goroutine can publish to that one handler.

Related PR: https://github.com/grafana/xk6-browser/pull/555

Sep 29 '22 16:09 ankur22

We believe that this has been resolved (for now) with PR #555 (issue #553, which was created due to this and the deadlock issue #496). We have ran several tests locally, and in the cloud and it hasn't resurfaced yet. There's a chance that this could still occur though (obviously in case we missed anything) if the CDP events are sent out of order by the browser.

Oct 07 '22 15:10 ankur22

xk6-browser xk6-browser copied to clipboard

Flaky timeout in `Frame.waitForExecutionContext()`

xk6-browser
xk6-browser copied to clipboard