ferrum icon indicating copy to clipboard operation
ferrum copied to clipboard

Ferrum hangs with DeadBrowserError for some reason other than running out of RAM

Open geoffharcourt opened this issue 2 years ago • 2 comments

Hi,

We've been using Cuprite for a long time, and in the last year or so we've started to see an increasing rate of failures for DeadBrowserError in our CI pipeline. At first we thought it was RAM, so we increased the amount of RAM on our CI containers, but then continued to have failures. We also started recording system RAM when we got a failure and we typically have over 5GB free RAM when the Ferrum::DeadBrowserError first appears. From that point forward, all future system tests on that container will fail.

We tried resetting the browser with reset! and retrying the example with rspec-retry, but still no luck. We've finally gotten around to collecting logs, and I can't find anything conclusive from the logs.

I've captured a (very long, I can truncate it but thought the whole log might be necessary) log file here from one of our CI containers after it stopped being able to run specs and failed with:



1st Try error in ./spec/system/teacher/views/lessons_spec.rb:759:
--
  | There's no context available
  | Browser is dead or given window is closed

https://gist.github.com/geoffharcourt/b6315e4b7a20bea9c11a0aee755196a8

A few notes that might be helpful:

The dead browser error never happens mid-example, it's always the first page visited in an example.

We use percy-capybara to grab screenshots of pages mid-spec and upload the DOM and assets to Percy, and it uses the installed Chrome. I suspect it might be somehow to blame for what's going on, but I've seen containers fail where Percy didn't take any snapshots (or also hasn't yet taken snapshots), and in failed runs the failure doesn't always follow an example where a snapshot was taken.

We use knapsack-pro to parallelize our test suite across multiple machines in CI. It runs RSpec several times during each machine's portion of the test suite on a small subset of our code.

The examples that fail aren't the same files, nor do they consistently have the same preceding examples run before them from what I can tell. We're really struggling to understand what's going on here, if you have any ideas at all it would be a huge help. Thanks for your work on this project, it's really great.

geoffharcourt avatar Jun 03 '22 23:06 geoffharcourt

@geoffharcourt were you ever able to figure this out? I've started seeing the same thing. For me, it's only happening on Github Actions CI, it will run locally every time 😕

lserman avatar Aug 18 '22 20:08 lserman

@lserman we're not on Github Actions. We're running these test steps in a Docker container on Buildkite. We kept this debugging on for several weeks, and in every failure there was a huge amount of RAM left, so while this failure has been linked to that problem in the past, I think this looks like a new cause.

geoffharcourt avatar Aug 22 '22 09:08 geoffharcourt

After a lot of further testing, we're very sure this is related to opening new browser tabs and the wrong window handles being referenced in the following spec, but we're not 100% sure how to fix it. Our temporary solution is monkey-patching Cuprite like this:

module Capybara::Cuprite
  class Browser
    def visit(*args)
      goto(*args)
    rescue Ferrum::DeadBrowserError, Ferrum::TimeoutError => exception
      puts "Window handles at #{exception.class} (#{Time.current}): #{window_handles}"
      puts "Restarting browser..."
      Capybara.reset!
      puts "Window handles just after restart (#{Time.current}): #{window_handles}"

      raise exception
    end
  end
end

Since we added this change (and some other debugging to show the window handles in previous tests), we can see this happens after a spec where two window handles existed (from a popup or a second tab being opened). I suspect that the second tab might either not be getting closed or it might be getting closed late, but we're not sure.

geoffharcourt avatar Oct 13 '22 01:10 geoffharcourt

@geoffharcourt do you have a debug log for this test when it's failing?

route avatar Oct 13 '22 06:10 route

We were running into these issues intermittently as well. We installed minitest-retry in our system test suite helper and provided those two exceptions to the configuration. Seems to be working ok right now, but certainly a bit of a hack.

lancecarlson avatar Nov 04 '22 20:11 lancecarlson

@geoffharcourt @lancecarlson still log with issue is appreciated.

route avatar Nov 05 '22 05:11 route

@route what is your recommended why to log the issue? Sorry I don't have much experience with configuring Ferrum because I wasn't the one that setup the CI initially.

lancecarlson avatar Nov 05 '22 13:11 lancecarlson

@lancecarlson You can pass in a file to write the debug log to when you create a browser: link

So for example, you could do Ferrum::Browser.new(logger: $stderr)

or

log = File.open('ferrum.log', 'w')
at_exit { log.close }

browser = Ferrum::Browser.new(logger: log)

ttilberg avatar Nov 05 '22 13:11 ttilberg

Hi @route apologies for the wait. We had a run with several dead browser errors this morning. Here's a link to the log output. This is really long, I couldn't find what was relevant or useful so I'm just putting it all here for now. https://gist.github.com/geoffharcourt/94f86f75d30b1e2ad90edc5e2f5cba30

geoffharcourt avatar Nov 09 '22 16:11 geoffharcourt

No problem @geoffharcourt I'll take a look

route avatar Nov 09 '22 16:11 route

FWIW, the workaround I posted above turned this from a pretty frustrating situation into one that one retry nearly always works around. It seems like this might be evidence of an issue that's not the usual out of RAM problem.

geoffharcourt avatar Nov 09 '22 16:11 geoffharcourt

I went through the log file and didn't spot any issue regarding the problem, but what I think weird is this line:

7.329515521000076 {"method":"Runtime.callFunctionOn","params":{"returnByValue":true,"objectId":"-8742411280284681957.2.134","functionDeclaration":"function() { return _cuprite.isVisible(this) }","arguments":[]},"id":1366}

7.335774601999901 {"method":"Fetch.enable","params":{"handleAuthRequests":true,"patterns":[{"urlPattern":"*"}]},"id":1367}

7.3531246129998635 {"method":"Target.disposeBrowserContext","params":{"browserContextId":"6DDC3F570DD6804CE57039079ABFCC73"},"id":9}

So we work with a node in DOM and then suddenly calling Fetch.enable again to intercept requests, after that you are disposing the browser's context.

@geoffharcourt Could you try to experiment with turning off blacklist/whitelist options and network interruptions and see if it helps the issue disappear?

route avatar Nov 11 '22 12:11 route

I'll reopen when you provide some info, without it I can't help.

route avatar Nov 20 '22 16:11 route