xk6-browser Intermittent test failures and flaky tests

For a few months now we've noticed an increase in intermittent test failures (e.g. #421), as well very frequent flaky tests and CI failures (#272, #254). This has become a major issue for both users and internal devs working on the extension.

From initial work explored in #281 that aimed to fix #272, it became apparent that these issues stem from the racy nature of how we process CDP events, which can trigger one or more internal events. This is explained in detail in this discussion, along with a proposed architecture to address these issues.

An ancillary problem with the current architecture is that these event flows and race conditions are difficult to trace and understand, and since they're intermittent and difficult to test we're never sure if we're introducing more of them as we make changes.

Example

To illustrate an example of a race condition that should be addressed, consider the following scenario from TestPageGoto (failure example in CI build #1123):

xk6-browser sends the Page.navigate CDP command:

-> {"id":20,"sessionId":"75F7E22B7AD1085646FE46C053DD6729","method":"Page.navigate","params":{"url":"http://127.0.0.1:40721/static/empty.html","frameId":"AF0723A7356FB62639D7BA483EBF615E"}}

Done here: https://github.com/grafana/xk6-browser/blob/6baeca02e7515dc5b39c05601078e27688ee0e88/common/frame_manager.go#L617

The browser responds with the load Page.lifecycleEvent:
```
<- {"method":"Page.lifecycleEvent","params":{"frameId":"E0C85D008DBA499F54F24A284F032911","loaderId":"B16AD01C43E1F176E80ED03C4F35081E","name":"load","timestamp":564.999827},"sessionId":"8CC5472C3955F483AFA9B01CA9ED3B2C"}
```
... which we process before we've had the chance to setup the waiter for it here: https://github.com/grafana/xk6-browser/blob/6baeca02e7515dc5b39c05601078e27688ee0e88/common/helpers.go#L156-L169

This leads to us missing the event, and waiting for a second load event which never arrives, ultimately timing out at the default 30s.

This is a very simple scenario, and other more complex ones are common as well. For example, the race condition between calling ElementHandle.click() and Page.waitForNavigation(), or calling Page.waitForSelector() while a hard navigation is in progress, which resets the ExecutionContext, so we end up evaluating the injected.waitForSelector() JS call with a stale ExecutionContext, causing the cryptic Cannot find context with specified id (-32000) error, etc.

Proposed Solution

In the #142 discussion we mentioned removing the internal event-based system in favor of an event-driven finite-state machine. This might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event. This solution was explored in #281, but we had other test issues with that approach and never merged it.

OTOH, targetted solutions like this would have to be done on a case-by-case basis, whereas a general removal of the internal event-based system and introduction of an FSM should work in all cases, so that might be preferable. Some investigation is needed to settle on the better approach.

Another solution (#428) that would improve things is exposing more async APIs. E.g. in the click()/waitForNavigation() example above, if both of these methods returned a Promise, we could document that the user must use Promise.all() to synchronize both calls, in order to setup waitForNavigation() before click() is called. You can see that Playwright mentions this approach in their documentation.

Jun 30 '22 12:06 inancgumus

...FSM... might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event.

It seems to me that the frame navigations cause most of these issues. What about making a frame-specific halt-the-world event that stops processing the events until preconditions are met (similar to the older Go garbage collector's stop-the-world mechanism—but for a specific frame).

For example, imagine a frame navigates, in that case, we're going to stop receiving events both from the browser and our internals until we set up the relevant event listeners and execution contexts in place. Then we'll unlock the world and let the system components of that frame start processing the events.

Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

Jul 05 '22 09:07 inancgumus

I've been exploring two potential solutions to this issue. Here's a summary that we can discuss in a meeting I'll schedule for Monday:

A smaller, focused, fix for just the navigation race that involves returning a Promise from FrameManager.NavigateFrame(), which we're already working on as part of #428. This Promise could be fulfilled once FrameManager.frameNavigated() is called, and we wouldn't need to emit the internal EventFrameNavigation.

The current indirection in FrameManager.NavigateFrame() of calling FrameSession.navigateFrame() and waiting for EventFrameNavigation, which is actually emitted in the same FrameManager instance(!), is a bit silly...

I'm not entirely sure if this implementation would work without some experimentation, but I think it's feasible, and would definitely be less total effort than the second approach.
A major refactor of our internal event system, essentially getting rid of most of it, and a minor refactor of how we currently handle CDP events.

First of all, let's describe the problems with our internal event system:
- As mentioned in the GH discussion, trying to understand the flow of execution is confusing, since events can be emitted by anything that embeds BaseEventEmitter. As you can see from the table below, that's a lot of sources.
- In most cases, internal events are mapped to or wrap CDP events. See event_emitter.go. So what we do is 1) receive the CDP event, 2) run our CDP event handlers, 3) emit the internal event, 4) run our internal event handlers. To simplify things, we can just listen to the original CDP event instead and avoid the indirection.
- In many cases, events are emitted but not consumed by anything. Or they're used as constants only, and not emitted or consumed. Or not used at all, and just placeholders for future functionality. Or they're consumed by the same emitter that produced it. :facepalm: See the table below. Either that, or I wasn't able to find where they're produced or consumed, which is again part of the confusion.
- There are two ways of listening for events:
  1. Registering with on() and reading from the Event channel.
  2. Using createWaitForEventHandler() and waitForEvent().
  We know that the second way is racy—that is what started this issue—, and having two ways of doing things adds to the confusion, as it's often not easy to see where events are produced and consumed. The table below is an attempt of tracking emitters and consumers, but it might be wrong or incomplete because of this.

Internal event emitters (producers) and consumers

Producer	Event	Consumer(s)
None¹	`EventBrowserDisconnected`	`Browser`
None²	`EventBrowserContextClose`	`BrowserContext`
`BrowserContext`	`EventBrowserContextPage`	`BrowserContext`
`Connection`	`EventConnectionClose`	`Browser`
	`Target.attachedToTarget` (CDP)	`Browser`
	`Target.detachedFromTarget` (CDP)	`Browser`
`Frame`	`EventFrameAddLifecycle`	`Frame`, `FrameManager`, `Page`
	`EventFrameRemoveLifecycle`	None
	`EventPageDOMContentLoaded`	None
	`EventPageFrameNavigated`	`FrameSession`
	`EventPageLoad`	None
`FrameManager`	`EventFrameNavigation`	`Barrier`, `FrameManager`, `Page`
	`EventPageFrameAttached`	`FrameSession`
	`EventPageFrameDetached`	`FrameSession`
	`EventPageRequest`	None
	`EventPageRequestFailed`	None
	`EventPageRequestFinished`	None
	`EventPageResponse`	None
`FrameSession`	`EventPageError`	None
`Page`	`EventPageClose`	None
	`EventPageCrash`	None
`Session`	`EventSessionClosed`	None
	All session CDP events (`Page.lifecycleEvent`, `Network.requestWillBeSent`, etc.)	`FrameSession`, `NetworkManager`
`Worker`	`EventWorkerClose`	None
None	`EventPageConsole`, `EventPageDialog`, `EventPageDownload`, `EventPageFileChooser`, `EventPagePopup`, `EventPageWebSocket`, `EventPageWorker`	None

NOTES: 1: Not actually emitted by anything. We just use it as a constant value in Browser.On(). 2: Not actually emitted by anything. We just use it as a constant value in BrowserContext.WaitForEvent().

The general approach of tackling the major refactor would be:

Creation of a top-level cdp package that will have most of the current common.Connection, common.Session and anything else related to CDP communication.
Creation of a cdp.EventWatcher struct that will be in charge of adding listeners and notifying them with the CDP event data. This is inspired by Puppeteer's LifecycleWatcher, and the implementation would be similar to our current BaseEventEmitter. The EventWatcher could be initialized e.g. once per Frame, and each Frame instance would be notified only of events that are meant for it.
Getting rid of the internal event emission system (BaseEventEmitter, etc.). If we don't want to expose CDP events directly to the rest of the system (i.e. it would be good to not have common or other packages import cdproto and other CDP modules directly), we might want to keep our internal events and have a small translation layer between CDP and these internal events.

Even though it would be a much more laborious process than any focused fixes, and it might take us a few weeks to finish, possibly the entire cycle for one person, I think we'd end up with a simpler and overall easier to understand system. We don't need to go to extreme lengths of implementing something unique like an FSM, and considering this is not the approach taken by either Puppeteer nor Playwright, I think it would serve us well to not deviate too far from their work.

Jul 07 '22 15:07 imiric

@inancgumus I considered your suggestion, but I don't think we want to stop receiving events. You're describing the functionality of a queue, i.e. don't process new events until a previous one completes, which is what channels essentially give us. So, yes, we ultimately want to synchronize in that way, but we can do that by not blocking and using a Promise, as I described above.

That said, I don't think we should do any focused fixes like this, but instead take the time to work on a major refactor that will address this and other race conditions, while ending up with a more maintainable and understandable code base.

Jul 07 '22 15:07 imiric

I was suggesting something else, nevertheless, yes, I was also on the side of using Promises to do some part of it 👍

Hopefully, maybe we can do it using https://github.com/grafana/xk6-browser/issues/428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier.

And, as I explained in #142, +1 for major refactoring. Anyway, let's discuss the rest of the details in the architectural refactoring and async API meeting 🤝

Jul 07 '22 18:07 inancgumus

@grafana/k6browserdev What should we do with this issue? The main race conditions mentioned in the description have now been fixed, and the major refactoring approach discussed above is not a high priority to address this, and has been downscoped from our initial plans.

We do still have intermittent test failures and flaky tests, but it's not clear what the "definition of Done" is for this particular issue.

So I vote to close this, and open new issues for specific flaky behavior. Or we could reuse this issue as a general one for all such behavior, but there's a lot of noise in the discussion above, and we'd need to change the description and add a TODO list, so we might as well open a new issue even in that case.

Oct 12 '22 16:10 imiric

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

Oct 12 '22 16:10 mstoykov

So I vote to close this, and open new issues for specific flaky behavior.

I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be

Agreed with both of these 👍

Oct 12 '22 17:10 ankur22

IMO, we can close this one and give a reference to this from the new issues when needed.

Oct 12 '22 18:10 inancgumus

OK, closing this, and let's open one issue per flaky behavior we find. Though maybe not immediately if it happens once in CI, but if it really becomes a problem.

Oct 13 '22 09:10 imiric

xk6-browser xk6-browser copied to clipboard

Intermittent test failures and flaky tests

Example

Proposed Solution

Internal event emitters (producers) and consumers

xk6-browser
xk6-browser copied to clipboard