cef icon indicating copy to clipboard operation
cef copied to clipboard

chrome: Higher rate of "connection retry failed" renderer process crashes

Open magreenblatt opened this issue 1 year ago • 3 comments

Describe the bug Chrome runtime is seeing a higher rate of "connection retry failed" renderer process crashes (code here) compared to Alloy (same web content). We should revisit the discussion in #3260 and either (a) increase the timeouts for Chrome runtime, or (b) change the FATAL to ERROR to avoid crashing. Highlights from the previous discussion:

Original comment by Bernard de Champlain

In all cases that were logged either all of the connection retries for a specific frame eventually were acknowledged (some with well over 30 seconds of delay in rapid burst of ACKs from the browser process) or none of them were. So even though in theory the connection request message isn’t guaranteed to be delivered, I’ve seen no case where some of the request weren’t received (and I have an idea why, in some cases, none were ACKed). The ACK can take a very long time to be received in the browser process UI thread where they are processed. So, I don’t see any point to increase the number of retries, I’d even think that 2 retires are good enough to be sure the first request didn’t just get lost by some very rare occurrence. I do see value in being more patient for the timeout and possible even having an extra grace period for the last retry.

I’ve tried CEF builds with FATAL “Connection retry failure” changed to ERROR status and played with the number of retries, the timeout delays and also made the last retry timeout longer to see what logs I’d get while putting the pc under strain by linking libcef.dll, initial check out Chromium 101 and CEF-101 builds while running our application (on Windows 10). In all cases, the changing of the FATAL log to ERROR status had no detrimental or visual effect except for occasional hiccups.

Trials with only changing the timeout from 4 to 10 seconds have given me perfect results (while keeping the retries at 3). I also changed the FATAL exception to ERROR status for my build. My results with windows under cpu/memory strain that would have previously triggered FATAL multiple times, didn’t trigger it once within 5 hours of strain (while running our regression playlist). All of the frame disconnections ended up properly connected before failure would have been called.

Original comment by Dmitry Azaraev

Sorry, but timeout has nothing reasonable with broken logic. Response time depends only to given system load. Imagine what you overload system to 10x times: every thread will get 10x less time slices (approx), and in sealed cases this may lead to 30x be slower. Just noting what timeouts usually configurable exactly because it’s value reasonable according to system load, but not per se / not constant value.

magreenblatt avatar Mar 07 '24 20:03 magreenblatt

The vast majority of crashes are:

[ 00 ] logging::LogMessage::~LogMessage() ( logging.cc:937 )
[ 01 ] static void CefFrameImpl::OnDisconnect(CefFrameImpl::DisconnectReason) ( frame_impl.cc:609 )
[ 02 ] mojo::InterfaceEndpointClient::NotifyError(std::__Cr::optional<mojo::DisconnectReason> const &) ( interface_endpoint_client.cc:739 )
[ 03 ] static void base::internal::Invoker<base::internal::BindState<void (mojo::internal::MultiplexRouter::*)(bool),base::internal::UnretainedWrapper<mojo::internal::MultiplexRouter,base::unretained_traits::MayNotDangle,0>,bool>,void ()>::RunOnce(class base::internal::BindStateBase *) ( bind_internal.h:920 )
[ 04 ] static void base::internal::Invoker<base::internal::BindState<void (mojo::Connector::*)(const char *, unsigned int),base::internal::UnretainedWrapper<mojo::Connector,base::unretained_traits::MayNotDangle,0>,base::internal::UnretainedWrapper<const char,base::unretained_traits::MayNotDangle,0> >,void (unsigned int)>::Run(class base::internal::BindStateBase *, unsigned int) ( bind_internal.h:933 )
[ 05 ] void base::internal::Invoker<base::internal::BindState<void (*)(const base::RepeatingCallback<void (unsigned int)> &, unsigned int, const mojo::HandleSignalsState &),base::RepeatingCallback<void (unsigned int)> >,void (unsigned int, const mojo::HandleSignalsState &)>::Run(class base::internal::BindStateBase *, unsigned int, const struct mojo::HandleSignalsState & const) ( bind_internal.h:933 )
[ 06 ] static void mojo::SimpleWatcher::OnHandleReady(int, unsigned int, const struct mojo::HandleSignalsState & const) ( simple_watcher.cc:278 )

A small minority are:

[ 00 ] logging::LogMessage::~LogMessage() ( logging.cc:938 )
[ 01 ] static void CefFrameImpl::OnDisconnect(CefFrameImpl::DisconnectReason) ( frame_impl.cc:609 )
[ 02 ] static void CefFrameImpl::OnBrowserFrameTimeout() ( frame_impl.cc:529 )
[ 03 ] base::OneShotTimer::FireNow() ( timer.cc:163 )

magreenblatt avatar Mar 08 '24 17:03 magreenblatt

It looks like disconnects can occur after browser process shutdown or crash.

For example (at M123):

  1. Run cefclient.exe --vmodule=frame_*=1 --enable-logging=stderr --no-sandbox --enable-chrome-runtime --url=https://www.youtube.com
  2. Close the window
  3. Get the following input
[4720:12084:0308/144027.832:VERBOSE1:frame_host_impl.cc(516)] frame 5-B6DA8473C9D0540FEA38676D78C8420C (main) detached (reason=NEW_MAIN_FRAME, is_connected=1)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(615)] frame 5-25DB837C3184E63A52DD1739C2CD2EAB disconnected (reason=DETACHED, current_state=CONNECTION_ACKED, FRAME_INVALID)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(615)] frame 5-B6DA8473C9D0540FEA38676D78C8420C disconnected (reason=BROWSER_FRAME_DISCONNECT, current_state=CONNECTION_ACKED)
[6068:15572:0308/144028.426:VERBOSE1:frame_impl.cc(628)] frame 5-B6DA8473C9D0540FEA38676D78C8420C connection retry scheduled

Repeating the above using a Debug build under system load (for example, while building Chromium) may delay execution sufficiently for the retry to trigger. Also, potentially if the render process is left behind as a zombie (see #3614).

magreenblatt avatar Mar 08 '24 19:03 magreenblatt

Related chromium-dev question about detecting browser process termination from the renderer: https://groups.google.com/a/chromium.org/g/chromium-dev/c/tFJCeTzvsnc/m/lzYhPAfkAAAJ

magreenblatt avatar Mar 08 '24 20:03 magreenblatt