github-action Github actions: "Re-run failed jobs" will run the entire test suite

trafficstars

We have the e2e tests configure to run on cypress dashboard parellely. I was following this thread to add the custom-build-id to the command to distinguish different run based on different build id. Every thing works fine until github actions roll out the ability to Re-run failed jobs. If i just set the custom-build-id to ${{ github.run_id }}, the second attempt will always marks tests as passing with 'Run Finished' but tests are not triggered at all. So I set set the custom-build-id to ${{ github.run_id }}-${{ github.run_attempt }}, now it will run the entire test suite instead of running the originally allocated subset of tests.

 E2E_tests:
    runs-on: ubuntu-latest
    name: E2E tests
    strategy:
      fail-fast: false
      matrix:
        ci_node_total: [6]
        ci_node_index: [0, 1, 2, 3, 4, 5]
    timeout-minutes: 45
    steps:
        - uses: actions/checkout@v2

        - name: Use Node.js
          uses: actions/setup-node@v2

        - name: Install Dependencies
          run: npm ci

        - name: Start app
          run: make start-app-for-e2e
          timeout-minutes: 5

        - name: Cypress Dashboard - Cypress run
          run: |
              npm run cypress

Apr 04 '22 19:04 BioCarmen

im facing the same issue, in re run failed jobs, cypress runs all the tests again without the parallel setup that it used in the first run

Apr 11 '22 12:04 tanimaroy2012

I'm having this same issue as well -- anyone found any workarounds to this?

May 08 '22 00:05 ninasarabia

+1 👍

May 08 '22 15:05 dannyskoog

That's definitly something critical given how the billing works (Cypress and Github included), it sounds like we're getting billed for something that already passed

May 13 '22 10:05 tebeco

@BioCarmen @ninasarabia do you have any links to runs that you could share where this is happening?

May 16 '22 17:05 admah

Is there any update on this? Getting charged for an entire test suite re-run when one test fails on one parallel job is really upsetting, given the size of our test suite.

Jun 15 '22 14:06 samanthablasbalg

Im experiencing this also. Any updates on a fix or workaround?

Jun 22 '22 19:06 VinceBarresi

We are seeing the same issue - here is our configuration

- name: Run integration tests
        timeout-minutes: 20
        uses: cypress-io/github-action@v4
        env:
          CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          ci-build-id: ${{ needs.prepare.outputs.uuid }}
          config: baseUrl=${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on: ${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on-timeout: 120
          browser: chrome
          record: true
          parallel: true
          group: merge
          install: false
          working-directory: tests/web

Jul 04 '22 09:07 ilovegithub2

Reading up on https://docs.cypress.io/guides/guides/parallelization, I think this may be a side effect on how cypress load balances things. Things aren't evenly split up in a deterministic way, it tries to find un-run tests and distribute them to the workers which are free. When you do re-run only failures, you are essentially just reducing the amount of total workers that can process all the jobs.

Aug 05 '22 14:08 thijsnado

so this mean cypress code base might need more work to address that issue ?

we're still getting over billed because of internal implementation, is that what you're saying ?

Aug 05 '22 14:08 tebeco

I can't comment since I don't work for cypress but this part https://docs.cypress.io/guides/guides/parallelization#CI-parallelization-interactions makes me think that each individual container won't always run the same tests. The re-run is just interpreted as another parallel test run but with less containers to run them.

Aug 05 '22 14:08 thijsnado

it's slightly worse than this if you use standard github "re-run fail job" feature (the title of this issue)

rerun literally every test suite
don't parallelize the rerun as it did originally
if you had a 5min timeout per parallel run and 5 rune it will then 100% fail because of both previous points. it would need 25min to run in less than 5min

so you're down to re-rub the matrix run (not the title of this issue) and you're billing for 25min instead of 5min if only one failed

Aug 05 '22 15:08 tebeco

@tebeco is it that it doesn't parallelize or that only one of the containers failed so all the tests get run in that one container? I'd have to try a few more times to know for certain but I think if you have multiple containers fail it will parallelize those containers but more tests will run per container since the "passing" containers don't do anything.

Aug 05 '22 16:08 thijsnado

both are bad, think about the billing one test fail ... should be 1-2 min you're billed about 25x more core now in my previous example

and that is regardless is the parallel matrix respected or not since all test / run minutes are accounted

i think not trimming in parallel would be less critical if

it only re-ram the fail test
or a threshold on rerun to split
or rerun the same container count but only what failed in each so that "job 1" would still be "job 1"

for now it's unpredictable / and full rerun / full billing only

Aug 05 '22 16:08 tebeco

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

Sep 13 '22 23:09 admah

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@admah I just tested this after upgrading to 10.8.0 and still saw all of the tests run in a single job when one of the parallelized containers had a failed test.

To give some more detail, the codebase I am working on uses the Cypress parallelization feature, attached to Cypress dashboard, to split our test suite into 5 different jobs. In this situation, one test failed in one of the parallelized jobs. To retry this test, I clicked the "re-run failed jobs" button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

Thanks for looking into this, it would be a huge improvement to our CI pipeline if this issue was resolved!

Sep 14 '22 13:09 kinson

Thanks for looking into this, it would be a huge improvement to our CI pipeline if this issue was resolved!

Agree. This fix is very much needed to optimise the CI run time and avoid the unnecessary trigger of the tests that have already passed in one attempt. There by reducing the billing.

Sep 14 '22 13:09 ashanka-singh-qatalog

Yes, I also tried to replicate this last night and saw this same behavior:

I clicked the "re-run failed jobs" button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

This by default will fail the job, because one single worker can't possibly run all of the tests before the job timeout kicks in (which is why it is parallelized in the first place). We are on 10.7.0.

I believe what would need to happen is for Cypress to remember which tests get allocated to which workers so that if there is a failure on worker 3 of 5, and "re-run failed jobs" is selected on the GHA side, the same set of tests will get re-run on that worker.

Sep 14 '22 15:09 samanthablasbalg

I was able to get some more clarity on this from our Cloud team. Issue #574 also has some additional context.

Here is the current status:

Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.
Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

I will be updating this issue as new information is available.

Sep 14 '22 19:09 admah

For the issue I wrote that is linked above, turns out to be any re-run job is skipping every single test regardless if its from start or failed when run against cypress 10.9.0 and the cypress: cypress-io/[email protected] orb

Oct 05 '22 17:10 trevor-bennett

@admah Any news? Thanks

Nov 18 '22 11:11 dannyskoog

All those problems could be fixed if dashboard could work this way for same dashboard run KEY.

7 test, 3 workers

first run

run all tests and load balance them on all workers
5 / 7 tests green, 2 workers failed

next runs with same cypress KEY

check dashboard result for given KEY and failed tests
run all failed tests and load balance them on two available runners (two failed workers so on rerun github provides only those two)

At least this would work fine with GitHub imho.

Not sure how hard is it to implement but it is on dashboard side to orchestrate and send tests to workers so my guess would be that this should not be very hard. Unless somehow done tests suites runs cannot be updated...

Nov 21 '22 16:11 piotrekkr

I was able to get some more clarity on this from our Cloud team. Issue #574 also has some additional context.

Here is the current status:
* Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.

@admah Does it mean if I rerun failed workers with same cypress run KEY (ci-build-id param in cypress-io/github-action@v4 gh action), cypress action will fail now? Previously it was returning success after few seconds without running any actual tests on workers. I needed to make some workarounds to fail myself in that case, before even triggering cypress action.

* Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

@admah You mean rerun with same cypress run KEY? If so this seems contradicting with first point.

Nov 21 '22 16:11 piotrekkr

@admah is there a planned release version for this yet?

Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

Dec 16 '22 18:12 kinson

The ability to re-run failed tests is becoming more and more necessary as we scale; it's making us consider alternatives to Cypress cloud

Dec 20 '22 19:12 davisg123

@admah Any news on this?

Jan 18 '23 11:01 Git2DaChoppaa

@Git2DaChoppaa

According to https://www.linkedin.com/in/amu/ Adam Murray (@admah) doesn't work for Cypress.io any more.

Jan 18 '23 11:01 MikeMcC399

Looking at commits history maybe @jaffrepaul is working for Cypress and could answer about any news? Thanks

Jan 18 '23 12:01 piotrekkr

@piotrekkr

Looking at commits history maybe @jaffrepaul is working for Cypress

You are right. Just hover with your mouse over the link to his username and it shows "Member of Cypress.io", which means he belongs to that organization (company). He wrote in https://github.com/cypress-io/github-action/issues/648#issuecomment-1341325209:

There is a new team of two aiming to get ALL Cypress tools and plugins back up to date. There are a number of things needing attention in the GHA.

Jan 18 '23 13:01 MikeMcC399

@MikeMcC399 Okay good to know. Seems like they plan to do some updates in actions so we need to wait. Thanks

Jan 18 '23 13:01 piotrekkr

github-action github-action copied to clipboard

Github actions: "Re-run failed jobs" will run the entire test suite

first run

next runs with same cypress KEY

github-action
github-action copied to clipboard