cloudflared Fix flaky tests

This PR should potentially eliminate (or at least greatly reduce) testActiveSessionNotClosed flakiness at the Github CI. As described in https://github.com/cloudflare/cloudflared/pull/766#issuecomment-1264874623, this is caused by an eager timeout due to insufficient CI CPU resources.

I didn't maintain the same ratio of activeTime = closeAfterIdle * 5 as it'd make tests run ~50% longer while having potentially no difference in behavior besides running for a longer period of time.

Side effect: fixed datagramsession tests are now ~1.5s longer - 1.026s previously vs ~2.513s with this PR.

This PR also fixes race condition in TestConnections caused by concurrent read/write operations to shared variable replayer by newly created goroutines, which I got on CI after trying to fix the testActiveSessionNotClosed, CI logs are here, Copy for the future reference is here, as CI logs have limited lifetime.

Oct 03 '22 10:10 moofMonkey

There are 2 flaky tests left that I see, but I don't understand their problems due to the lack of cloudflared internals knowledge.

TestManagerCtxDoneCloseSessions: here, here and here - 3 out of 10 tests
TestDatagram: here - 1 out of 10 tests, looks like CI environment specific

Tests fixed by this PR doesn't seem to be flaky anymore given 10 runs out of which they didn't fail once.

Oct 03 '22 11:10 moofMonkey

TestManagerCtxDoneCloseSessions had a race condition due to select not having guaranteed execution order in Golang. Citing the spec, https://go.dev/ref/spec#Select_statements

If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

https://github.com/cloudflare/cloudflared/blob/bad2e8e812d898c32b0aacc66a6f947109eaf24a/datagramsession/manager_test.go#L196-L197 https://github.com/cloudflare/cloudflared/blob/bad2e8e812d898c32b0aacc66a6f947109eaf24a/datagramsession/session.go#L64-L68 https://github.com/cloudflare/cloudflared/blob/bad2e8e812d898c32b0aacc66a6f947109eaf24a/datagramsession/session.go#L86-L89 https://github.com/cloudflare/cloudflared/blob/bad2e8e812d898c32b0aacc66a6f947109eaf24a/datagramsession/manager.go#L69-L70 https://github.com/cloudflare/cloudflared/blob/bad2e8e812d898c32b0aacc66a6f947109eaf24a/datagramsession/manager.go#L88-L95

There are two possible ways to fix this - either remove require.False(t, closedByRemote) check, or check for context.Canceled in manager.shutdownSessions(). I did it the second way, but that could be changed if needed. I made a lot more CI runs and it's not flaky anymore.

Other tests that were addressed and didn't fail in at least last 15 CI runs:

TestLongSiteWithDictionaries was flaky, CI logs are here. That was caused by insufficient CPU resources. I attempted to fix it by using errGroup.SetLimit().
TestConnections was still flaky. CI logs are here. I attempted to fix it by adding a waitgroup for pipedRequestBody case, and the previous fix isn't needed anymore as goroutine should be done at the time of the test ending, and tests aren't being run in parallel.
TestDatagram flakiness still depends on the CI environment, fresh CI logs are here, this only occurs on the MacOS CI. I attempted to fix it by failing the test instead of timing it out on error and by waiting for the MTU discovery for a bit longer (300ms instead of 100ms).
TestQUICServer was timing out flaky, CI logs are here. I attempted to fix it by failing the test instead of timing it out on error in case the server doesn't finish, by waiting for the server to start.

Other tests that weren't addressed after the last fail:

TestGracefulShutdownHTTP2 is flaky, CI logs are here and here. This is probably caused by a race condition between Observer.dispatchEvents() and the test using eventCollectorSink.assertSawEvent() where Observer.dispatchEvents() doesn't get enough CPU time to trigger eventCollectorSink.OnTunnelEvent().
testActiveSessionNotClosed is still a bit flaky, CI logs are here and here. The cause is still CPU limits. It's pretty rare and closeAfterIdle has to be increased to avoid that, but I don't know if it's worth it.

mockOriginProxyWithRequest.ProxyHTTP() also sleeps for just 5ns in /slow_echo_body due to time.Sleep(5), and duration (such as time.Millisecond) is probably meant to be specified there.

Oct 04 '22 06:10 moofMonkey

TestGracefulShutdownHTTP2 didn't fail in last 30 CI runs, other tests didn't fail in last 46 CI runs. That's 0 failed CI runs out of 30 compared to 25 failed CI runs out of 30 most recent ones on master.

I think that's good enough, this PR is ready for review now.

Oct 04 '22 08:10 moofMonkey

TestQUICServer is still a bit flaky. CI logs are here. This is caused by partial read from stream. io.ReadFull should solve that.

Oct 05 '22 00:10 moofMonkey

30 more CI runs with 578888e3 and the only flaky test was TestProxySingleOrigin for once, CI logs are here. I don't understand what caused it to time out, but it doesn't seem that was caused by my changes. TestQUICServer shouldn't be flaky with this commit anymore.

Oct 05 '22 03:10 moofMonkey

@moofMonkey thank you for looking into this. We have improved the tests over the last 2 years. Feel free to rebase and open again if they are still flaky.

Mar 05 '24 17:03 chungthuang