eclipse.platform.swt icon indicating copy to clipboard operation
eclipse.platform.swt copied to clipboard

Browser tests: reduce timeout for instantiation to reasonable value

Open HeikoKlare opened this issue 1 year ago • 11 comments

The timeout for browser instantiation in the browser tests has been increased to avoid flaky tests (via #1677). However, browser instantiation taking more than 10 seconds (the previous timeout value) is an indicator for something being wrong with the instantiation. There is no reason why instantiation may take that long during test execution but the same issue should never arise in productive use. Thus, such a case should be captured by tests and, if necessary, be fixed productively.

In order to do so, this change reduces the test timeout value again to a more reasonable value.

Related to https://github.com/eclipse-platform/eclipse.platform.swt/issues/1676

Also see https://github.com/eclipse-platform/eclipse.platform.swt/pull/1677#issuecomment-2567437084

HeikoKlare avatar Jan 06 '25 13:01 HeikoKlare

Test Results

   545 files  +   480     545 suites  +480   35m 18s ⏱️ + 35m 17s  4 374 tests + 4 353   4 356 ✅ + 4 335   18 💤 + 18  0 ❌ ±0  16 638 runs  +16 533  16 498 ✅ +16 393  140 💤 +140  0 ❌ ±0 

Results for commit 05b9b28c. ± Comparison against base commit 8c937588.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jan 06 '25 13:01 github-actions[bot]

Builds sporadically show timeouts, e.g.:

Error:  Tests run: 380, Failures: 1, Errors: 0, Skipped: 12, Time elapsed: 193.3 s <<< FAILURE! -- in org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser
Error:  org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.test_BrowserFunction_callback[browser flags: 0] -- Time elapsed: 66.79 s <<< FAILURE!
java.lang.AssertionError: creating browser took too long: 65267ms
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.createBrowser(Test_org_eclipse_swt_browser_Browser.java:312)
	at org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.setUp(Test_org_eclipse_swt_browser_Browser.java:185)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)

Might be related to https://github.com/eclipse-platform/eclipse.platform.ui/issues/2734

HeikoKlare avatar Jan 19 '25 15:01 HeikoKlare

I rebased on master and solved the conflicts. Let's see if there are still (too many) test failures.

@HeikoKlare according to your comment (https://github.com/eclipse-platform/eclipse.platform.swt/pull/1752#issuecomment-2623933890) this PR might need to be postponed until #1752 is merged, correct?

fedejeanne avatar Jan 30 '25 14:01 fedejeanne

This PR should actually not need any other PR at all. The failing tests indicate that there is a bug in either the browser implementation or the test setup that are not aware of yet, so something we need to find out independent from any other PR (as I already mentioned in https://github.com/eclipse-platform/eclipse.platform.swt/pull/1677#issuecomment-2567437084).

HeikoKlare avatar Jan 30 '25 15:01 HeikoKlare

What I mean is that "if we merge this PR after the ones that are already open then it is to be expected that there are no/less test failures due to timeouts" . But I see that you want to use this PR as a basis to check if the other PRs actually help (reduce/get rid of the test failures because of timeouts and freezes).

So, shall we merge this one then?

fedejeanne avatar Jan 31 '25 07:01 fedejeanne

"if we merge this PR after the ones that are already open then it is to be expected that there are no/less test failures due to timeouts"

As we know, that's not the case. We still need to find the issue for the browser instantiation in tests taking so long. All the fixes we currently make regarding timeouts are only supposed to avoid that the actual errors (which we may not know yet) do not lead to a blocking UI. They do not properly solve the underlying issue.

So, shall we merge this one then?

I am not sure if people will be happy about that. Most Windows builds will fail with test failures then. So I wouldn't do that but rather find out the reasons for the failure first.

HeikoKlare avatar Jan 31 '25 07:01 HeikoKlare

Note: I just confirmed that it's always about the first browser testing being executed, no matter which test that is. I added an ignore to the two tests that are usually executed first and then the new first executed test fails.

Additional note: first test failing even happens when making the Tycho build on GH Actions only execute AllBrowserTests. So it is not about any leftover from teste executed before the browser tests.

HeikoKlare avatar Jan 31 '25 13:01 HeikoKlare

Note: the thread dump written after 60 seconds of test execution time show that the usually the main thread is stuck in display,sleep() / OS.WaitMessage() while waiting for browser instantiation:

[2025-02-01 17:15:07 +0000] org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.test_BrowserFunction_multiprocess[browser flags: 0]() ran for more than 60 seconds
totalMemory:             146800640
freeMemory (before GC):   96621696
freeMemory (after GC):    58666000
"main" prio=5 Id=1 RUNNABLE (in native)
	at app//org.eclipse.swt.internal.win32.OS.WaitMessage(Native Method)
	at app//org.eclipse.swt.widgets.Display.sleep(Display.java:4803)
	at app//org.eclipse.swt.browser.Edge.processOSMessagesUntil(Edge.java:471)
	at app//org.eclipse.swt.browser.Edge$WebViewProvider.getWebViewWrapper(Edge.java:370)
	at app//org.eclipse.swt.browser.Edge$WebViewProvider.getWebView(Edge.java:381)
	at app//org.eclipse.swt.browser.Edge.getUrl(Edge.java:894)
	at app//org.eclipse.swt.browser.Browser.getUrl(Browser.java:771)
	at app//org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.createBrowser(Test_org_eclipse_swt_browser_Browser.java:308)

One might think that there is some wake message, but the actual issue is that the OS message triggering the WebView instantiation callback did actually not arrive. When adding an explicit timed, repeated awakening, the display still goes into sleep repeatedly since no message arrived from the OS so far. In addition, adding debug output in the loop waiting for the OS message shows that the loop is spinned quite often.

Further finding: Once might also think that there is some message scheduled asynchronously on the display, which is not processed by the processOSMessagesUntil() (as it only processed OS messages) but may need to be processed (for whatever reason) to proceed with the processing the queue. In that case, whoever, display.sleep() would always returning immediately, as the display's synchronizer queue is not empty. Debug output also shows that this is usually not the case: the display properly goes to sleep in the scenarios where browser instantiation takes long, so there is actually nothing in the event queue (neither from the OS nor scheduled at the display by the application).

So the underlying issue is that the OS does not call the WebView instantiation callback for a long time (> 60 seconds). It is also interesting that usually after 65 seconds the callback is finally executed.

HeikoKlare avatar Feb 01 '25 19:02 HeikoKlare

I've also been looking at it and all I came up with is:

  1. It's true that the problem is the callback is simply being called after ~65s (the delay I've seen is pretty consistent, always between 65.2s and 65.7s).
  2. I've seen the callback take the ~65s delay "only" 1-2 times per CI-run (and if it does it twice, the 2nd time is usually in the next 30 seconds after the 1st one).

Regarding point nr. 1, I used this commit to log the request and the callback and I saw this output:

1st delay: 65.237.957.400 ns

1356349440500 - ICoreWebView2Environment.CreateCoreWebView2Controller()
...
1421587397900 - Edge.createControllerInitializationCallback()

**2nd delay: 65.262.087.700 ns **

1423716947100 - ICoreWebView2Environment.CreateCoreWebView2Controller()
...
1488979034800 - Edge.createControllerInitializationCallback()

A quick internet search revealed that Instantiating Edge can be slow, but what I still don't get is:

  • Why is it sometimes slow and not all the time ?
  • Why is the delay consistently something ~65s ?

@HeikoKlare let's discuss how to proceed with this PR in today's daily.

fedejeanne avatar Feb 04 '25 08:02 fedejeanne

FYI there is a 2nd commit in this PR, one is from https://github.com/eclipse-platform/eclipse.platform.swt/pull/1797

fedejeanne avatar Feb 05 '25 16:02 fedejeanne

Yes, that was intended for testing purposes. We still see issues with WebView intialization sometimes taking long, so this is not ready for being merged anyway. I thus changed it back to "draft".

HeikoKlare avatar Feb 05 '25 16:02 HeikoKlare

Do you plan to finish/submit this one?

akurtakov avatar Jul 29 '25 13:07 akurtakov

Unfortunately, we did not find any way to ensure that WebView2 instantiation on Windows always works in a timely manner, especially in test setups. We came across some issue reports that indiciate that this seems to be a common issue with WebView2. Since we did not face any issue with non-timely instantiation of WebView2 in production so far, I don't see any real need and any way forward here. Thus closing it.

HeikoKlare avatar Jul 29 '25 14:07 HeikoKlare