site-kit-wp Reduce running time of the VRT workflow.

Feature Description

As the VRT test suite has grown, it's taking longer to run and its runtime can currently exceed 20 minutes.

We should see if we can reduce the time it take to run, potentially by running tests in parallel, pruning unneeded scenarios, or any other suitable measure.

Do not alter or remove anything below. The following sections will be managed by moderators only.

Acceptance criteria

The runtime duration of the VRT test suite should be reduced.
There is no specific target duration that it should be reduced to; what's achievable here should be informed by the IB.

Implementation Brief

Update the backstop config properties asyncCaptureLimit and asyncCompareLimit to have much higher values that will let us run VRT tests faster during GHA runs.
- Unfortunately, this is quite hard to say exactly, which values we should use there without running a number of tests in GitHub. I suggest starting with something like 100 for the capture limit and 1000 for the compare limit. If these values are too high and VRT task runs slower than it was before, then use smaller values. If it goes faster, try to use higher values to see if we can make it run even faster.
- When considering the most appropriate value for the compare limit, take into consideration that backstop requires approximately 100MB + 5MB/image memory. The standard GitHub-hosted runner for public repositories has 16GB RAM, which means that in theory we can try to use the compare limit not bigger than 3180, but we need to keep in mind that some part of RAM is used by ubuntu itself and GHA runner. So, realistically the final value for this limit should be below 3000.

Test Coverage

N/A

QA Brief

Changelog entry

Aug 02 '23 12:08 techanvil

With current settings, test takes between 17-19min. I experimented with different number of parallel captures and compares, so far 5 seems to be optimal number for asyncCaptureLimit. 4 is reducing speed locally for 2-3min, but not on CI, it actually increases it for 1 min, in best case it can run within the same time as with 5 parallel captures.

Increasing asyncCaptureLimit to 6 is the top limit for parallel captures, without introducing flakiness, but speed is not reduced, on contrary it adds 1-3min more to complete, most likely due to resources. Everything above 6 affect stability, as it introduces flaky behaviour for random tests failing, and increasing time for the tests.

asyncCompareLimit top safe value is 60, which makes no effect on reducing speed, everything above introduces flakiness as well. So safe number is the default one currently - 50

I checked the current reference images, and I would not recommend reducing some scenarios for sake of saving few minutes, as having them around is more beneficial.

I tested with adding several chromium flags passed in engineOptions, but no noticable results were achieved.

So far based on my experiments, and discussion with @aaemnnosttv the most efficient solution would be splitting scenarios into 2 or 3 batches and run them in separate jobs, but this is not something that we can pursue due to the limits on number of actions for the organization, which was an issue in the past, and most likely present still.

Un-assigning myself in case someone else might have an idea to try something that goes beyond what I tried.

Oct 13 '23 11:10 zutigrm

Based on what @zutigrm found out, I don't think there is a way to improve the performance just by tweaking configs.

Most VRT runs take about 12 minutes but some (like this one) take almost half an hour with no obvious reason.

To improve the performance, I tested some alternatives to Backstop, like Playwright, but while locally they perform way better then Backstop (about half of the time), on CI they are actually worse.

But for Playwright, they recommend sharding the test job using a matrix. So the tests would not only run on one job but on multiple, let's say 4 workers in parallel.

I think this could be an option for us with Backstop, too.

This approach would require following tasks:

In the visual-regression.yml workflow, create a matrix like

      fail-fast: false
      matrix:
        shard: [1/4, 2/4, 3/4, 4/4]

In the job, pass over the matrix value npm run test:visualtest and make sure it is passed through to the /bin/backstop as a argument.
In the /bin/backstop read the argument and pass it as a ENV var to the docker run.
In the tests/backstop/config.js read this env var, slice the scenarios into X chunks (maybe 4?). Based on the env var, pass over the corresponding chunk to the backstop config.

In theory this reduce the time needed for the actions to complete, but won't reduce the billed time, likely even increase it a bit.

@aaemnnosttv do you have any thoughts on this?

Other observations while working on this:

We could get rid of the scenarios.js by using the buildStoriesJson storybook option, but this would require us to migrate all stories to the component story format. The old format is still used in 35 places.
We have a high number of delays configured that in total add up to about more then one minute of just waiting. I'm unsure if all of them a re needed, it seams that most of them are a way to fix backstop bugs. However, even if we can get rid of this one minute, that would not make a huge difference.

Nov 24 '23 08:11 derweili

@derweili Since we're an open source project CI minutes aren't an issue, open source projects have unlimited CI minutes I believe 😄. Either way shouldn't be a concern here.

Parallelising the work seems like a fine idea here, but what happens if there's a test failure regarding BackstopJS's output? Right now we output the contents of the VRT tests as an artifact of the test. Would this make four artifacts? That'll be quite annoying to deal with when debugging failures in CI, and will mean the way they're run in CI will be quite different from how they're run locally.

An easy win could be investigating whether or not we need that minute or so of "waiting" but I would suspect we do, eg. they're there because the test was flaky in CI otherwise. 🤔

What would the artifacts/outputs of this sharded test look like? A proof-of-concept PR might be a good idea here 🙂

Nov 24 '23 21:11 tofumatt

@tofumatt I don't think it's feasible to remove the waiting time. The tests already seam to be flaky from time to time. The wait time was added to reduce flakyness.

I created a PoC PR. Every sharded test has it's own artifacts/outputs. See here https://github.com/google/site-kit-wp/actions/runs/7016701764?pr=7918

There is no "official" way to merge backstop reports. But I looked into the report structure and it seams possible to manually merge them:

The tests/backstop/tests/* folders of all shards have to be merged.
The tests/backstop/tests/**/report.json JSON files have to be merged.
Using the merged json file, a custom tests/backstop/html_report/config.js has to be created.

However, this needs some thoughts on how to merge the tests without having naming conflicts. I'm wondering if implementing the report merge is worth it. If VRT fails in CI, you will likely run it locally anyway.

What do you think?

Maybe @aaemnnosttv also has an opinion on this.

Nov 28 '23 09:11 derweili

@derweili That's largely true, but there have been times where a developer had issues with VRTs and being able to download the entire report was useful. It also just makes doing comparisons of failures a bit more work if they can't be reproduced locally—these have happened in the past, so I'm just wary of having the output of a test run in CI be different than local. 🤔

It even makes things like seeing how many test failures there were more effort, because if multiple shards fail you need to go through each one and count the issues… like this https://github.com/google/site-kit-wp/actions/runs/7016701764/job/19088447317?pr=7918#step:5:1341

Also: from that proof-of-concept PR, it looks like it ends up taking over double the amount of time to run the VRTs with the sharding approach than running them as one job.

It's about 12 minutes as one job:

CleanShot 2023-11-28 at 22 53 57 CleanShot 2023-11-28 at 22 54 10

The sharded approach takes 25 minutes:

CleanShot 2023-11-28 at 22 53 52

(I'm not sure if the 50% failure rate is related to the change or not.)

Given 2 minutes per shard is spent doing npm install and other setup tasks, the sharding approach would need to save at least more than two minutes per shard.

The issue was filed in August, and it looks like then the tests took a lot longer to run:

CleanShot 2023-11-28 at 23 00 09

But this is no longer the case. I don't think this is as urgent as before so I don't think we should resort to an approach as drastic as the sharding one… but in this case it looks like that's not faster anyway 🤔

Nov 28 '23 23:11 tofumatt

The runtime is a bit missleading, because in the PoC each run builds it's own storybook, which takes about 1.5m.

So because of this initial time done by each job, this is doubled. I think we could improve this by building the storybook once in a build job and then sharing the dist folder with the matrix-VRT job. But uploading and downloading Artifacts isn't fast either.

So the benefit is:

The pipeline passes faster

The drawbacks:

In case of error, harder debugging as we have multiple artifacts
Longer total run time

As @tofumatt wrote, currently the VRT tests are faster then when this ticket was created. Maybe in the future in case we have longer VRT times again, it's worth revisiting this approach. But for now I don't have another idea on how to decrease the 12m VRT time without introducing drawbacks for debugging.

Nov 29 '23 13:11 derweili

Unassigning myself

Nov 29 '23 15:11 derweili

I tried this approach in #8233, but the recommended values seem to cause instability and test failures. There's no out-of-memory issues, just test instability. 🤔

I'll try some lower values and see if it improves test run speed in CI, but I'm not sure this will work.

I tried a bunch of these but they all seem to cause test failures: https://github.com/google/site-kit-wp/pull/8233/commits

Feb 07 '24 23:02 tofumatt

That's pity... At least vrt tests went much faster than the current one that we have. Your vrt tests went just almost 3 mins:

Whilst current vrt tests take approximately 10 mins:

Feb 08 '24 16:02 eugene-manuilov

@eugene Are you planning to work on this soon? If not, can you please unassign yourself so someone else can pick it up? Thank you!

Mar 19 '24 16:03 ivonac4

We could also audit the use of delay values set in many stories, as a lot of these are unnecessary and they add up quite quickly with the number of tests we have.

Jun 18 '24 10:06 benbowler

I'm taking a look at this again as it relates to my Q4 hackathon. I'm planning to work on upgrading the backstop package versions and refactoring legacy stories. Once this is done, if I have time, I can experiment and see if the VRTs can be run more reliably in parallel on newer versions.

Nov 19 '24 12:11 benbowler

RE my above point, #8538 still requires some other package upgrades so unlikely to be updated in the short term.

Nov 21 '24 12:11 benbowler

On a catchup with @zutigrm today about what we can do here as the approach of changing the config of storybook appears to have stalled with minimal improvement in performance instead we could refocus this ticket based on the following:

In this ticket (and follow ups) merge any story that has multiple variants to single VRT Story that contains all variants and only add the scenario prop on this combined story only.
Add a new parameter to disable viewports which can be passed on components such as Badge, Button, Link etc which disables the duplicate images on these for mobile, tablet, desktop.

May 23 '25 09:05 benbowler

site-kit-wp site-kit-wp copied to clipboard

Reduce running time of the VRT workflow.

Feature Description

Acceptance criteria

Implementation Brief

Test Coverage

QA Brief

Changelog entry

site-kit-wp
site-kit-wp copied to clipboard