DeadBrowserError when remote Chrome session returns HTTP 408 Request Timed Out
The Bug
So, I've been hunting this bug for almost five years. It is a little bit tough to reproduce because it's super random and only shows up in heavy-traffic parallel testing, I couldn't setup a minimal reproduction yet. But it boils down to this: when running system specs in parallel, SOMETIMES your Cuprite/Ferrum Browser setup will crash with the notorious DeadBrowserError, and your entire spec suite is probably going to fail as well.
After some tracking, I found out that the bug happens because there is an attempt to write into a closed websocket. That websocket, in turn, is closed because One or more reserved bits are on: reserved1 = 1, reserved2 = 0, reserved3 = 0. However, that's also not the real reason. The underlying websocket driver closes the connection because it can't parse the opcode. But the opcode is not from the websocket exchange proper, but rather from this little snippet of code in Ferrum:
https://github.com/rubycdp/ferrum/blob/180f29297f2be7b770b32d8a2f73343e9530e67c/lib/ferrum/client/web_socket.rb#L95-L106
It would appear that the specific packet leading to that dreaded DeadBrowser error is a timeout packet:
HTTP/1.1 408 Request Timeout\r\nContent-Type: text/plain; charset=UTF-8\r\nContent-Encoding: UTF-8\r\nAccept-Ranges: bytes\r\nConnection: keep-alive\r\n\r\n\r\nRequest has timed out
So instead of a properly formed websocket packet, we get a plaintext HTTP 408, which isn't parsed by the websocket-driver -> closes the websocket-driver -> triggers a DeadBrowserError the next time we try to send anything through that websocket.
I'm really unsure what should happen in a situation like this. Is it even a valid behavior on chrome's part? Should we handle this in our specs by catching/retrying? One of the more annoying aspects of this is that it completely taints the whole capybara setup. You can't Capybara.reset! out of this, your entire spec suite is bricked, which is kinda brutal for a spec suite that runs for 20+ minutes.
Let me know if you have any ideas!
Environment
It's a remote chrome configuration using browserless/chromium:latest docker image and ferrum 0.17.1.
Chrome /json/version response:
{
"Browser": "Chrome/136.0.7103.25",
"Protocol-Version": "1.3",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/136.0.0.0 Safari/537.36",
"V8-Version": "13.6.233.4",
"WebKit-Version": "537.36 (@97d495678dc307bfe6d6475901104e262ec7a487)",
"webSocketDebuggerUrl": "ws://chrome:3333",
"Debugger-Version": "97d495678dc307bfe6d6475901104e262ec7a487"
}
Okay, for any future people googling this, I found the culprit: turns out, browserless/chromium container has two vaguely documented settings:
-
TIMEOUT— session timeout in milliseconds -
CONCURRENT— max browser concurrency
Hit either of these two limits, you'll see these malformed HTTP 1.1 408 Request Timeout packets. What's not immediately obvious, is that this TIMEOUT is timed per browser session, and your browser session persists during your spec run. That's why I could only reproduce this bug when running long spec suites but not when running a single spec — running a single spec wouldn't hit the timeout.
That said, for my project this issue is considered solved. Up to you guys whether you want to deal with out-of-loop websocket channel closes which could still happen, browserless container or not.