cypress icon indicating copy to clipboard operation
cypress copied to clipboard

Recover from renderer / browser crashes

Open brian-mann opened this issue 8 years ago • 80 comments

Related to #348.

It is actually possible for Cypress to implement strategies when the renderer (or browser process) crashes during a test run - something like recoverFromRendererCrashes: true by default.

There is already a mechanism for Cypress to "reload" mid-run, rebuild the state of every previous run test, skip over previously run tests, and continue with the next one in line.

In fact this is exactly what cy.visit already does under the hood.

We can utilize this same process upon a renderer / browser process crashing to continue on with the run.

So it may look something like this:

(Running Tests)

✓ test 1 - foo
✓ test 2 - bar
✓ test 3 - baz

Oh noes the renderer process crashed... we will attempt to recover

...Restarting tests at 'test 4 - quux'

✓ test 4 - quux
✓ test 5 - ipsum

Taking this a step further, we are starting to see several patterns emerge with how and why renderer processes crash - it is almost always related to extremely long test runs in a memory starved environment (such as Docker).

It may even be a good idea for us to always preemptively "break up" headless runs by spec file.

In other words, we could have an option like restartBrowserBetweenSpecFiles: true which would automatically kill the renderer / browser process before moving on to a different spec file (but still rebuild the state of the UI correctly, and still have a single contiguous video recording).

To the user it would look like nothing is really different, but internally the renderer process would be killed and then restarted.

This would forcefully purge primed memory from the process, which could keep environments like docker from ever crashing to begin with.

Depends on: #6170

brian-mann avatar Dec 16 '16 21:12 brian-mann

We actually have these crashes halfway in a single spec and we have stalling too. I tried debugging this with strace and it seems to be constantly trying to acquire some locks. Our app seems to make the browser allocate 400+ MB of memory fast and the whole suite can go up to 2 GB... So resetting between specs might not be enough. Maybe between it/test is also an option?

Setting the --ipc=host does fix this, but I wonder what happens if two instances of the test run simultaneously. Could a clash occur?

jheijkoop avatar Dec 28 '16 14:12 jheijkoop

How could two instances of the test run occur simultaneously? If you wanted to parallelize you would do it over two different docker containers.

brian-mann avatar Dec 28 '16 17:12 brian-mann

yes, two docker instances. It might be a false fear of a clash. I'm completely unaware/ignorant of what the two docker instances do share with --ipc=host

jheijkoop avatar Dec 28 '16 21:12 jheijkoop

Hi I am running test cases on aws ec2 small instance and I am having this issue https://on.cypress.io/renderer-process-crashed Is there any way to avoid this

muslim-niche avatar Jul 14 '17 09:07 muslim-niche

Did you try the --ipc=host fix?

jheijkoop avatar Jul 14 '17 09:07 jheijkoop

But I am not using Docker

muslim-niche avatar Jul 14 '17 10:07 muslim-niche

if not sandboxed, you might have multiple chrome instances fighting over resources. What is your setup? any concurrency? are you open to a different setup?

jheijkoop avatar Jul 14 '17 13:07 jheijkoop

Any update on this? As Im now getting the error with Chromium usually crashes when running amount of test suites.

khiettran avatar Nov 01 '17 04:11 khiettran

This issue has been superseded by this: https://github.com/cypress-io/cypress/issues/681

That will remove the need to recover since it fixes the problem at its core

brian-mann avatar Nov 01 '17 05:11 brian-mann

We've started hitting this fairly frequently now too

tizmagik avatar Mar 22 '18 15:03 tizmagik

I'm having this happen randomly on travis-ci with cypress 3.0.2 (I just recently started using cypress so no clue if it happened in a previous version). It might be good to add this flag even with #681 resolved.

Edit: I was able to resolve my issue by only calling .visit() once and resetting the state of the application between tests. I know that's not ideal, but it works for now.

dsherret avatar Jul 08 '18 20:07 dsherret

In hindsight my fix with --ipc=host might be related to the shared memory issue I described in https://github.com/cypress-io/cypress/issues/350 and giving the container more shared memory might resolve crashes.

jheijkoop avatar Jul 09 '18 08:07 jheijkoop

I'm also getting this issue now with cypress v3.1.0. Any updates?

jdtzmn avatar Sep 24 '18 23:09 jdtzmn

Hi cypress team!

We are also getting this error when we use cypress run as well as cypress open

We noticed that it happens more when we use cy.wait. We can consistently reproduce it when we use cy.wait with a value greater than 20000. This is on our circle-ci linux containers fyi.

mechanical-turk avatar Sep 28 '18 22:09 mechanical-turk

Hi, I'm currently trying to use cypress in Gitlab CI. I Figured out most parts, except the browser crashing.

my current gitlab CI test job is the following:

test_dev:
  only:
    - dev
  stage: test
  image: cypress/base:10
  script:
    - npm i --save-dev cypress
    - $(npm bin)/cypress run --reporter junit --reporter-options "mochaFile=results_[hash].xml,toConsole=true"
  artifacts:
    paths:  
      - cypress/videos
    reports:
      junit: results_*.xml
    expire_in: 1 week

This works great when the browser doesn't crash, including test reporting in gitlab's merge requests. However, it fails 50% of the times. Using the --ipc=host tag is afaik not an option in Gitlab CI.

gwaihir8 avatar Oct 30 '18 15:10 gwaihir8

Have you tried increasing the shared memory instead, like I discribe in https://github.com/cypress-io/cypress/issues/350 ?

jheijkoop avatar Oct 30 '18 20:10 jheijkoop

I am using shared runners on gitlab ci, and shm-size doesn't seem to be an option for shared runners. Thanks anyway

gwaihir8 avatar Oct 31 '18 07:10 gwaihir8

I think you can configure it using this documentation https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section

jheijkoop avatar Oct 31 '18 13:10 jheijkoop

Hi Please provide fix / explanation of this issue. It is always happening on one test case (and only one test case). I do not think it has to do with memory but there is no way to know. I was able to reproduce it locally without docker. I think it has to do with origin (subdomain) changes. Thanks

EDIT: Just ran in debug mode there is no way of knowing what is causing this problem unfortunately

egucciar avatar Nov 13 '18 22:11 egucciar

Hi, we're also experiencing this issue in Kubernetes (using Jenkins as our CI engine). Would be happy to provide additional information if helpful.

ajcann avatar Jan 30 '19 19:01 ajcann

I've recently started running into the issue, as our codebase starts to acquire more dependencies. It's intermittent and unpredictable. Sometimes I get a passing test, sometimes it fails the moment it begins.

After more experimentation, I've found that using the cypress/browsers:chrome69 image instead of the cypress/base:10 made the issue go away. This issue is likely to be tied to an older version of electron being unable to handle a larger codebase, and I think more effort should go into updating electron.

jpike88 avatar Feb 16 '19 07:02 jpike88

One useful thing in meantime would be if Cypress could have some way to communicate this to the caller that the browser failed. Then I could re-run the test inside CI automatically. Maybe an exit code from npm call could be different? Or some other way to determine that tests failed because of Chrome failing and not because of tests failing. Could this be added in meantime? So recovery could then be done outside of Cypress.

mitar avatar Feb 22 '19 00:02 mitar

I think that since this issue has been made there is now a better fix for the problem by asking Chrome not to use /dev/shm. I opened #3633 for more details about this.

mitar avatar Mar 03 '19 22:03 mitar

I'm hitting this issue on a small digital ocean droplet (no docker / container). The test runs perfectly a dozen or so times and then starts crashing with this error. If I reboot the droplet it starts working again then eventually dies. Looks like a memory leak to me.

itslenny avatar Mar 26 '19 19:03 itslenny

There appears to be plenty of memory in my docker container

df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
shm              30G  8.0K   30G   1% /dev/shm

I'm also unable to figure out how to add the --ipc=host flag for my CircleCI build... Doesn't appear to be an option.

ccorcos avatar Mar 27 '19 20:03 ccorcos

I am using shared runners on gitlab ci, and shm-size doesn't seem to be an option for shared runners. Thanks anyway

I have this same issue.

hhudson avatar May 26 '19 19:05 hhudson

A few days ago I started facing the same issue regardless no changes were made. It's running on Travis without docker and against a separate app that is not installed in the same code base. What interesting, that switching to --browser chrome seems to help with it, so looks like it is related to the electron no matter if it is headless or not - in both cases it's failing. However, with chrome, you lose the video recording. Any progress on this topic? @brian-mann

RockChild avatar May 28 '19 13:05 RockChild

I've become very impatient waiting for the Cypress folks to fix these crashing issues. In the meantime, I've created a very similar API using selenium and am having no memory issues. There's no recording of tests, but at least it's reliable. Here's a code snippet for you if you want to try it out.

import { Builder, ThenableWebDriver, By, WebElement, Key, Condition } from "selenium-webdriver"

/**
 * Wrap any promised coming from the Selenium driver so that we can
 * get stack traces that point to our code.
 */
async function wrapError<T>(p: Promise<T>) {
	const e = new Error()
	e["__wrapError"] = true
	try {
		const result = await p
		// Wait just a little bit in case the browser is about to navigate
		// or something.
		await new Promise(resolve => setTimeout(resolve, 20))
		return result
	} catch (error) {
		if (error["__wrapError"]) {
			throw error
		}
		e.message = error.message
		throw e
	}
}

async function waitFor(
	driver: ThenableWebDriver,
	fn: () => Promise<boolean | object>,
	timeout = 2000
) {
	await driver.wait(
		new Condition("wait", async () => {
			try {
				const result = await fn()
				return Boolean(result)
			} catch (error) {
				return false
			}
		}),
		timeout
	)
}

class Element {
	private promise: Promise<WebElement>
	then: Promise<WebElement>["then"]
	catch: Promise<WebElement>["catch"]

	constructor(
		public driver: ThenableWebDriver,
		promise: Promise<WebElement> | WebElement
	) {
		this.promise = Promise.resolve(promise)
		this.then = this.promise.then.bind(this.promise)
		this.catch = this.promise.catch.bind(this.promise)
	}

	/** Map in the monadic sense. */
	map(fn: (elm: WebElement) => Promise<WebElement | undefined | void>) {
		return new Element(
			this.driver,
			wrapError(
				this.promise.then(async elm => {
					const result = await fn(elm)
					if (result) {
						return result
					} else {
						return elm
					}
				})
			)
		)
	}

	waitFor(fn: (elm: WebElement) => Promise<boolean | object>) {
		return this.map(elm => waitFor(this.driver, () => fn(elm)))
	}

	mapWait(fn: (elm: WebElement) => Promise<WebElement>) {
		return this.waitFor(fn).map(fn)
	}

	click() {
		return this.map(elm => elm.click())
	}

	clear() {
		return this.map(elm => elm.clear())
	}

	type(text: string) {
		return this.map(elm => elm.sendKeys(text))
	}

	enter() {
		return this.map(elm => elm.sendKeys(Key.RETURN))
	}

	backspace() {
		return this.map(elm => elm.sendKeys(Key.BACK_SPACE))
	}

	find(selector: string) {
		return this.mapWait(elm => {
			return elm.findElement(By.css(selector))
		})
	}

	findAll(selector: string) {
		return new Elements(
			this.driver,
			this.promise.then(elm => {
				return waitFor(this.driver, () =>
					elm.findElements(By.css(selector))
				).then(() => {
					return elm.findElements(By.css(selector))
				})
			})
		)
	}

	contains(text: string) {
		return this.mapWait(elm => {
			// TODO: escape text.
			// https://stackoverflow.com/questions/12323403
			return elm.findElement(By.xpath(`//*[contains(text(), '${text}')]`))
		})
	}

	clickText(text: string) {
		return this.contains(text).click()
	}
}

class Elements {
	private promise: Promise<Array<WebElement>>
	then: Promise<Array<WebElement>>["then"]
	catch: Promise<Array<WebElement>>["catch"]

	constructor(
		public driver: ThenableWebDriver,
		promise: Promise<Array<WebElement>> | Array<WebElement>
	) {
		this.promise = Promise.resolve(promise)
		this.then = this.promise.then.bind(this.promise)
		this.catch = this.promise.catch.bind(this.promise)
	}

	/** Map in the monadic sense. */
	map(
		fn: (
			elm: Array<WebElement>
		) => Promise<Array<WebElement> | undefined | void>
	) {
		return new Elements(
			this.driver,
			wrapError(
				this.promise.then(async elms => {
					const result = await fn(elms)
					if (Array.isArray(result)) {
						return result
					} else {
						return elms
					}
				})
			)
		)
	}

	waitFor(fn: (elm: Array<WebElement>) => Promise<boolean | object>) {
		return this.map(elm => waitFor(this.driver, () => fn(elm)))
	}

	mapWait(fn: (elm: Array<WebElement>) => Promise<Array<WebElement>>) {
		return this.waitFor(fn).map(fn)
	}

	clickAll() {
		return this.map(async elms => {
			await Promise.all(elms.map(elm => elm.click()))
		})
	}

	atIndex(index: number) {
		return new Element(
			this.driver,
			wrapError(
				this.promise.then(elms => {
					const elm = elms[index]
					if (!elm) {
						throw new Error("Element not found!")
					}
					return elm
				})
			)
		)
	}
}

export class Browser {
	private promise: Promise<void>
	then: Promise<void>["then"]
	catch: Promise<void>["catch"]

	constructor(public driver: ThenableWebDriver, promise?: Promise<void>) {
		this.promise = Promise.resolve(promise)
		this.then = this.promise.then.bind(this.promise)
		this.catch = this.promise.catch.bind(this.promise)
	}

	visit(route: string) {
		return new Browser(
			this.driver,
			wrapError(
				this.promise.then(async () => {
					await this.driver.get(route)
				})
			)
		)
	}

	rerender() {
		return new Browser(this.driver, wrapError(rerender(this.driver)))
	}

	flushTransactions() {
		return new Browser(this.driver, wrapError(flushTransactions(this.driver)))
	}

	find(selector: string) {
		return new Element(
			this.driver,
			wrapError(
				this.promise
					.then(() => {
						return waitFor(this.driver, async () =>
							this.driver.findElement(By.css(selector))
						)
					})
					.then(() => {
						return this.driver.findElement(By.css(selector))
					})
			)
		)
	}

	getClassName(className: string) {
		return this.find("." + className)
	}

	getTitle() {
		return this.driver.getTitle()
	}

	waitFor(fn: () => Promise<boolean>, timeout = 2000) {
		return new Browser(this.driver, waitFor(this.driver, fn))
	}

	waitToLeave(url: string) {
		return new Browser(
			this.driver,
			wrapError(
				waitFor(
					this.driver,
					async () => {
						const currentUrl = await this.driver.getCurrentUrl()
						return url !== currentUrl
					},
					10000
				)
			)
		)
	}

	waitForRoute(url: string) {
		return new Browser(
			this.driver,
			wrapError(
				waitFor(
					this.driver,
					async () => {
						const currentUrl = await this.driver.getCurrentUrl()
						return url === currentUrl
					},
					10000
				)
			)
		)
	}
}

ccorcos avatar May 28 '19 17:05 ccorcos

We're seeing this issue crop up on Drone, which also doesn't support the --ipc=host option. Our containers already have 16GB memory. Some notes on the behavior:

  1. Electron logs an error message when it crashes, but actually fail the test run. Our build is green despite the fact that half the tests caused a renderer crash.

  2. Chrome doesn't even log a message—it dies silently and the test run hangs forever.

  3. The crash does appear to happen at the exact same time on every run, but it's not clear what we're doing to cause it. Rearranging our test code or skipping certain tests resolves the problem temporarily, but it always creeps back in.

I haven't contributed to Cypress before, but I'd be willing to take a stab at fixing the problem if someone (@brian-mann ?) can show me where to start. My team has lost a ton of time troubleshooting this and I'd love to put it to bed.

beepboopitschloe avatar Jun 03 '19 15:06 beepboopitschloe

@nmuth Please see our contributing guide on how to start: https://github.com/cypress-io/cypress/blob/develop/CONTRIBUTING.md

Are you using version 3.3.1?

jennifer-shehane avatar Jun 04 '19 07:06 jennifer-shehane