ferrum
ferrum copied to clipboard
Ferrum hangs with DeadBrowserError for some reason other than running out of RAM
Hi,
We've been using Cuprite for a long time, and in the last year or so we've started to see an increasing rate of failures for DeadBrowserError
in our CI pipeline. At first we thought it was RAM, so we increased the amount of RAM on our CI containers, but then continued to have failures. We also started recording system RAM when we got a failure and we typically have over 5GB free RAM when the Ferrum::DeadBrowserError
first appears. From that point forward, all future system tests on that container will fail.
We tried resetting the browser with reset!
and retrying the example with rspec-retry, but still no luck. We've finally gotten around to collecting logs, and I can't find anything conclusive from the logs.
I've captured a (very long, I can truncate it but thought the whole log might be necessary) log file here from one of our CI containers after it stopped being able to run specs and failed with:
1st Try error in ./spec/system/teacher/views/lessons_spec.rb:759:
--
| There's no context available
| Browser is dead or given window is closed
https://gist.github.com/geoffharcourt/b6315e4b7a20bea9c11a0aee755196a8
A few notes that might be helpful:
The dead browser error never happens mid-example, it's always the first page visited in an example.
We use percy-capybara to grab screenshots of pages mid-spec and upload the DOM and assets to Percy, and it uses the installed Chrome. I suspect it might be somehow to blame for what's going on, but I've seen containers fail where Percy didn't take any snapshots (or also hasn't yet taken snapshots), and in failed runs the failure doesn't always follow an example where a snapshot was taken.
We use knapsack-pro to parallelize our test suite across multiple machines in CI. It runs RSpec several times during each machine's portion of the test suite on a small subset of our code.
The examples that fail aren't the same files, nor do they consistently have the same preceding examples run before them from what I can tell. We're really struggling to understand what's going on here, if you have any ideas at all it would be a huge help. Thanks for your work on this project, it's really great.
@geoffharcourt were you ever able to figure this out? I've started seeing the same thing. For me, it's only happening on Github Actions CI, it will run locally every time 😕
@lserman we're not on Github Actions. We're running these test steps in a Docker container on Buildkite. We kept this debugging on for several weeks, and in every failure there was a huge amount of RAM left, so while this failure has been linked to that problem in the past, I think this looks like a new cause.
After a lot of further testing, we're very sure this is related to opening new browser tabs and the wrong window handles being referenced in the following spec, but we're not 100% sure how to fix it. Our temporary solution is monkey-patching Cuprite like this:
module Capybara::Cuprite
class Browser
def visit(*args)
goto(*args)
rescue Ferrum::DeadBrowserError, Ferrum::TimeoutError => exception
puts "Window handles at #{exception.class} (#{Time.current}): #{window_handles}"
puts "Restarting browser..."
Capybara.reset!
puts "Window handles just after restart (#{Time.current}): #{window_handles}"
raise exception
end
end
end
Since we added this change (and some other debugging to show the window handles in previous tests), we can see this happens after a spec where two window handles existed (from a popup or a second tab being opened). I suspect that the second tab might either not be getting closed or it might be getting closed late, but we're not sure.
@geoffharcourt do you have a debug log for this test when it's failing?
We were running into these issues intermittently as well. We installed minitest-retry in our system test suite helper and provided those two exceptions to the configuration. Seems to be working ok right now, but certainly a bit of a hack.
@geoffharcourt @lancecarlson still log with issue is appreciated.
@route what is your recommended why to log the issue? Sorry I don't have much experience with configuring Ferrum because I wasn't the one that setup the CI initially.
@lancecarlson You can pass in a file to write the debug log to when you create a browser: link
So for example, you could do Ferrum::Browser.new(logger: $stderr)
or
log = File.open('ferrum.log', 'w')
at_exit { log.close }
browser = Ferrum::Browser.new(logger: log)
Hi @route apologies for the wait. We had a run with several dead browser errors this morning. Here's a link to the log output. This is really long, I couldn't find what was relevant or useful so I'm just putting it all here for now. https://gist.github.com/geoffharcourt/94f86f75d30b1e2ad90edc5e2f5cba30
No problem @geoffharcourt I'll take a look
FWIW, the workaround I posted above turned this from a pretty frustrating situation into one that one retry nearly always works around. It seems like this might be evidence of an issue that's not the usual out of RAM problem.
I went through the log file and didn't spot any issue regarding the problem, but what I think weird is this line:
7.329515521000076 {"method":"Runtime.callFunctionOn","params":{"returnByValue":true,"objectId":"-8742411280284681957.2.134","functionDeclaration":"function() { return _cuprite.isVisible(this) }","arguments":[]},"id":1366}
7.335774601999901 {"method":"Fetch.enable","params":{"handleAuthRequests":true,"patterns":[{"urlPattern":"*"}]},"id":1367}
7.3531246129998635 {"method":"Target.disposeBrowserContext","params":{"browserContextId":"6DDC3F570DD6804CE57039079ABFCC73"},"id":9}
So we work with a node in DOM and then suddenly calling Fetch.enable
again to intercept requests, after that you are disposing the browser's context.
@geoffharcourt Could you try to experiment with turning off blacklist/whitelist options and network interruptions and see if it helps the issue disappear?
I'll reopen when you provide some info, without it I can't help.