site-kit-wp
site-kit-wp copied to clipboard
Reduce running time of the VRT workflow.
Feature Description
As the VRT test suite has grown, it's taking longer to run and its runtime can currently exceed 20 minutes.
We should see if we can reduce the time it take to run, potentially by running tests in parallel, pruning unneeded scenarios, or any other suitable measure.
Do not alter or remove anything below. The following sections will be managed by moderators only.
Acceptance criteria
- The runtime duration of the VRT test suite should be reduced.
- There is no specific target duration that it should be reduced to; what's achievable here should be informed by the IB.
Implementation Brief
- Update the backstop config properties
asyncCaptureLimitandasyncCompareLimitto have much higher values that will let us run VRT tests faster during GHA runs.- Unfortunately, this is quite hard to say exactly, which values we should use there without running a number of tests in GitHub. I suggest starting with something like
100for the capture limit and1000for the compare limit. If these values are too high and VRT task runs slower than it was before, then use smaller values. If it goes faster, try to use higher values to see if we can make it run even faster. - When considering the most appropriate value for the compare limit, take into consideration that backstop requires approximately 100MB + 5MB/image memory. The standard GitHub-hosted runner for public repositories has 16GB RAM, which means that in theory we can try to use the compare limit not bigger than
3180, but we need to keep in mind that some part of RAM is used by ubuntu itself and GHA runner. So, realistically the final value for this limit should be below3000.
- Unfortunately, this is quite hard to say exactly, which values we should use there without running a number of tests in GitHub. I suggest starting with something like
Test Coverage
- N/A
QA Brief
Changelog entry
With current settings, test takes between 17-19min.
I experimented with different number of parallel captures and compares, so far 5 seems to be optimal number for asyncCaptureLimit. 4 is reducing speed locally for 2-3min, but not on CI, it actually increases it for 1 min, in best case it can run within the same time as with 5 parallel captures.
Increasing asyncCaptureLimit to 6 is the top limit for parallel captures, without introducing flakiness, but speed is not reduced, on contrary it adds 1-3min more to complete, most likely due to resources. Everything above 6 affect stability, as it introduces flaky behaviour for random tests failing, and increasing time for the tests.
asyncCompareLimit top safe value is 60, which makes no effect on reducing speed, everything above introduces flakiness as well. So safe number is the default one currently - 50
I checked the current reference images, and I would not recommend reducing some scenarios for sake of saving few minutes, as having them around is more beneficial.
I tested with adding several chromium flags passed in engineOptions, but no noticable results were achieved.
So far based on my experiments, and discussion with @aaemnnosttv the most efficient solution would be splitting scenarios into 2 or 3 batches and run them in separate jobs, but this is not something that we can pursue due to the limits on number of actions for the organization, which was an issue in the past, and most likely present still.
Un-assigning myself in case someone else might have an idea to try something that goes beyond what I tried.
Based on what @zutigrm found out, I don't think there is a way to improve the performance just by tweaking configs.
Most VRT runs take about 12 minutes but some (like this one) take almost half an hour with no obvious reason.
To improve the performance, I tested some alternatives to Backstop, like Playwright, but while locally they perform way better then Backstop (about half of the time), on CI they are actually worse.
But for Playwright, they recommend sharding the test job using a matrix. So the tests would not only run on one job but on multiple, let's say 4 workers in parallel.
I think this could be an option for us with Backstop, too.
This approach would require following tasks:
- In the
visual-regression.ymlworkflow, create a matrix like
fail-fast: false
matrix:
shard: [1/4, 2/4, 3/4, 4/4]
- In the job, pass over the matrix value
npm run test:visualtestand make sure it is passed through to the/bin/backstopas a argument. - In the
/bin/backstopread the argument and pass it as a ENV var to the docker run. - In the
tests/backstop/config.jsread this env var, slice the scenarios into X chunks (maybe 4?). Based on the env var, pass over the corresponding chunk to the backstop config.
In theory this reduce the time needed for the actions to complete, but won't reduce the billed time, likely even increase it a bit.
@aaemnnosttv do you have any thoughts on this?
Other observations while working on this:
- We could get rid of the scenarios.js by using the buildStoriesJson storybook option, but this would require us to migrate all stories to the component story format. The old format is still used in 35 places.
- We have a high number of delays configured that in total add up to about more then one minute of just waiting. I'm unsure if all of them a re needed, it seams that most of them are a way to fix backstop bugs. However, even if we can get rid of this one minute, that would not make a huge difference.
@derweili Since we're an open source project CI minutes aren't an issue, open source projects have unlimited CI minutes I believe 😄. Either way shouldn't be a concern here.
Parallelising the work seems like a fine idea here, but what happens if there's a test failure regarding BackstopJS's output? Right now we output the contents of the VRT tests as an artifact of the test. Would this make four artifacts? That'll be quite annoying to deal with when debugging failures in CI, and will mean the way they're run in CI will be quite different from how they're run locally.
An easy win could be investigating whether or not we need that minute or so of "waiting" but I would suspect we do, eg. they're there because the test was flaky in CI otherwise. 🤔
What would the artifacts/outputs of this sharded test look like? A proof-of-concept PR might be a good idea here 🙂
@tofumatt I don't think it's feasible to remove the waiting time. The tests already seam to be flaky from time to time. The wait time was added to reduce flakyness.
I created a PoC PR. Every sharded test has it's own artifacts/outputs. See here https://github.com/google/site-kit-wp/actions/runs/7016701764?pr=7918
There is no "official" way to merge backstop reports. But I looked into the report structure and it seams possible to manually merge them:
- The
tests/backstop/tests/*folders of all shards have to be merged. - The
tests/backstop/tests/**/report.jsonJSON files have to be merged. - Using the merged json file, a custom
tests/backstop/html_report/config.jshas to be created.
However, this needs some thoughts on how to merge the tests without having naming conflicts. I'm wondering if implementing the report merge is worth it. If VRT fails in CI, you will likely run it locally anyway.
What do you think?
Maybe @aaemnnosttv also has an opinion on this.
@derweili That's largely true, but there have been times where a developer had issues with VRTs and being able to download the entire report was useful. It also just makes doing comparisons of failures a bit more work if they can't be reproduced locally—these have happened in the past, so I'm just wary of having the output of a test run in CI be different than local. 🤔
It even makes things like seeing how many test failures there were more effort, because if multiple shards fail you need to go through each one and count the issues… like this https://github.com/google/site-kit-wp/actions/runs/7016701764/job/19088447317?pr=7918#step:5:1341
Also: from that proof-of-concept PR, it looks like it ends up taking over double the amount of time to run the VRTs with the sharding approach than running them as one job.
It's about 12 minutes as one job:
The sharded approach takes 25 minutes:
(I'm not sure if the 50% failure rate is related to the change or not.)
Given 2 minutes per shard is spent doing npm install and other setup tasks, the sharding approach would need to save at least more than two minutes per shard.
The issue was filed in August, and it looks like then the tests took a lot longer to run:
But this is no longer the case. I don't think this is as urgent as before so I don't think we should resort to an approach as drastic as the sharding one… but in this case it looks like that's not faster anyway 🤔
The runtime is a bit missleading, because in the PoC each run builds it's own storybook, which takes about 1.5m.
So because of this initial time done by each job, this is doubled. I think we could improve this by building the storybook once in a build job and then sharing the dist folder with the matrix-VRT job. But uploading and downloading Artifacts isn't fast either.
So the benefit is:
- The pipeline passes faster
The drawbacks:
- In case of error, harder debugging as we have multiple artifacts
- Longer total run time
As @tofumatt wrote, currently the VRT tests are faster then when this ticket was created. Maybe in the future in case we have longer VRT times again, it's worth revisiting this approach. But for now I don't have another idea on how to decrease the 12m VRT time without introducing drawbacks for debugging.
Unassigning myself
I tried this approach in #8233, but the recommended values seem to cause instability and test failures. There's no out-of-memory issues, just test instability. 🤔
I'll try some lower values and see if it improves test run speed in CI, but I'm not sure this will work.
I tried a bunch of these but they all seem to cause test failures: https://github.com/google/site-kit-wp/pull/8233/commits
That's pity... At least vrt tests went much faster than the current one that we have. Your vrt tests went just almost 3 mins:
Whilst current vrt tests take approximately 10 mins:
@eugene Are you planning to work on this soon? If not, can you please unassign yourself so someone else can pick it up? Thank you!
We could also audit the use of delay values set in many stories, as a lot of these are unnecessary and they add up quite quickly with the number of tests we have.
I'm taking a look at this again as it relates to my Q4 hackathon. I'm planning to work on upgrading the backstop package versions and refactoring legacy stories. Once this is done, if I have time, I can experiment and see if the VRTs can be run more reliably in parallel on newer versions.
RE my above point, #8538 still requires some other package upgrades so unlikely to be updated in the short term.
On a catchup with @zutigrm today about what we can do here as the approach of changing the config of storybook appears to have stalled with minimal improvement in performance instead we could refocus this ticket based on the following:
- In this ticket (and follow ups) merge any story that has multiple variants to single
VRT Storythat contains all variants and only add the scenario prop on this combined story only. - Add a new parameter to disable viewports which can be passed on components such as
Badge,Button,Linketc which disables the duplicate images on these for mobile, tablet, desktop.
See also
- https://github.com/google/site-kit-wp/issues/10798
@aaemnnosttv Considering we have an issue for re-visiting and pruning the unecesarry VRT scenarios, and there is a longer list of replies/people who tried several approaches on reducing the VRT running time on this issue, and we do not have a workable solution I am thinking we can move this issue to Stalled in case we want to give it another run after the existing VRT scenarios have been cleaned up/reduced in the future. WDYT?
cc: @benbowler
Based on work on stability testing backstop/e2e testing I think I can bring some data to bear here to help make this decision. I will update the ticket early next week with my findings and take over assignment for now.
I have now performed a detailed analysis of these attributes so we can make a decision on moving this forward based on data (details on how I did this below).
Starting first with asyncCompareLimit, this attribute had negligible effect on the run time of the suite:
The lowest value in my analysis was 90, I have added this to the IB.
For asyncCaptureLimit values from 10->50 increased the run time before the speed increased slightly again to a plateau after 80:
Digging in around the low point I found 5 simultaneous captures to be the minima, therefore I've set this in the IB:
The Analysis
In order to measure this I created a GH action within my fork which would create and run a test matrix for each of these arguments, running backstop three times with each configuration and storing the results in csv artifacts for later review. You can see the analysis of the results in this sheet.
IB ✔
This doesn't need to be blocked by the node upgrade anymore so removing this.
Moving to directly to approval; no QA is needed for this issue.