[Feature] Split shards via test timing data
Playwright has some great default behavior around sharding tests across multiple workers. It would be super helpful if the shard can take in a --timing-data file similar to CircleCI's split tests logic so we can let Playwright internally split tests across the timing boundaries. This would allow Playwright to complete as fast as possible by splitting tests more optimally across multiple machines or workers.
This is the one thing that has made me hesitate switching from Cypress to Playwright despite the many advantages.
Cypress Cloud solves this with their Smart Orchestration (specifically, the "Load Balancing" strategy).
This is one of the major limitation of playwright.
I would love to see this implemented in Playwright. It is a fairly common feature among other test platforms and tooling.
I'd also love to see this feature. I thought of creating a PR with the following additions:
- Add '--timing-file' flag, e.g: --timing-file="report.json"
- The file will have the duration and id for each test
- Based on the duration we can build sets of as close as possible duration, then assign the test groups based on the test id.
So far I tried to use the json reporter and extract this data from it which seems to work, but it's a bit messy because of the 'infinite' amount of suites that you need to go over. Not sure if I'd want to generate another smaller timing file from the json reporter, or just use the json report file and extract the data for it in the playwright runner. Also thought of creating a completely new reporter which will only include the test name, test id and duration.
Based on my testing so far, we are able to create a pretty good balance for any amount of shards. I believe more details are required, but please let me know whether this is something that could be implemented, otherwise I'd just create a custom external solution instead.
Thanks.
A small scale example, current implementation:
New balance based on json report:
In most cases since the shards are running in parallel, what matters is when the last shard finishes running. Having 1 shard at 35 seconds and 1 shard at 1.5 minutes is as good as having 2 shards running for 1.5 minutes.
How the new shard balancing works:
- You point to a json report file
- From the report we extract the duration for each test based on ID
- Based on the amount of total shards we map the tests by duration
- Test groups are created based on the new mapping
- If there are new tests that are not recorded in the report they will be split evenly based on current implementation
This currently works for any amount of shards.
We have implemented something similar for Currents https://docs.currents.dev/guides/pw-parallelization/playwright-orchestration
@ofirpardo-artlist how did you end up doing this? We would great benefit from this.
@ofirpardo-artlist how did you end up doing this? We would great benefit from this.
@sfrique Since playwright just closed my PR without any actual explanation, I just use patch-package(https://www.npmjs.com/package/patch-package) to make the changes on playwright side: https://pastebin.com/vTy3k0uU
I can just upload the patch file if it will be easier to read perhaps.
I've created a custom reporter: https://pastebin.com/HpFxQKmZ
This gives me a json file for each passed test with the following: test id, test name, and test duration (technically this can be changed to include failed/flaky tests, but I don't think it's a good approach)
Then you can just point the timing file to that report, e.g:
npx playwright test --timing-file=/results.json
Personally what I do is also merge multiple reports for a better average every few CI runs so it's always updating itself to be accurate: https://pastebin.com/6xe0uHbk
And I also use multiple shards so my merge.config.ts file looks like this:
export default {
reporter: [
['./custom-reporter.ts', { outputFile: 'report.json', outputPath: 'e2e/blob-report' }],
],
};
Feel free to ask any questions.
@ofirpardo-artlist I've also created multiple PRs which try to improve sharding over the past couple of month, but all of them were either closed or reverted after being merged.
- https://github.com/microsoft/playwright/pull/30962
- https://github.com/microsoft/playwright/pull/33049
- https://github.com/microsoft/playwright/pull/30817
- https://github.com/microsoft/playwright/pull/31260
let me collect the responses given here… so, we can have a discussion
@pavelfeldman said: We've discussed it at length during the team meeting. The consensus was that we don't see a low-maintenance solution to the problem that would cover a meaningful number of use cases. Even the mature solutions that track execution timing and reuse this information in subsequent runs are brittle and have tendency to rot. We would go for a lower-level solution that would allow user to take over the scheduling (test list files or api calls), but we don't want to see a new extensive api surface or file formats there. I'm following up for transparency - we are thinking about the problem, but we don't see a good maintainable solution atm.
@dgozman said: We would really like a low level API to also be able to control the order of test execution. For example, one could imagine a strategy that would "run last failures first, and only if all of them pass - run all other tests". Ideally, we want strategies like this to be implementable on top of the proposed API.
@pavelfeldman said: We are looking for the ways to cover more use cases (with hooks) with a smaller api surface / maintenance cost. For example, allow test lists for shards so that users could use third party solutions to tune their exact configuration. Or allow a callback that would take over scheduling tests. Developing those to be easy in maintenance requires consideration and time that we can't currently allocate to the problem. But we are very open to keeping this communication in case a nice proposal comes up.
@pavelfeldman said: We know that people want test lists, but we are struggling with committing to a persistent test id that would be used in those (it becomes a part of our contract). Lists are also sub-optimal for those interested in sharding as shards have different lists. But many more customers are interested in custom failure retries where test lists are very useful, so it might be that committing it worth it. […] I like having lower-level primitives that allow for greater flexibility for power users to tune Playwright to their definition of perfection. Much more than having a handful of suboptimal presets that will only work for a couple of customers.
@dgozman @pavelfeldman I have to admit that I had not seen @ofirpardo-artlist 's https://github.com/microsoft/playwright/pull/30388 when I started working on my own solution. I really like that PR for it's simplicity…
- https://github.com/microsoft/playwright/pull/30388
Let's continue the discussion with the goal of splitting shards based on timing data.
There is a new proposal which could be helpful to allow users to implement their own sharding logic…
- https://github.com/microsoft/playwright/issues/33386
Really hope you'll manage to get it through, I personally gave up on it since it doesn't seem like they're really interested in that feature, and I really like my own solution using patch-package and a custom reporter. Makes everything very easy to upload to S3 and reuse it from S3. It's now even simpler than the original PR that I created which takes the whole json reporter and parses it. Hope to see something merged soon 👍
After reading a lot of the comments in this repository around randomness and sharding, there are two solutions that people are asking for:
- Randomness to ensure tests are ACID compliant.
- Shard balancing based on test timing.
Both are real needs, but they are very distinct. Shard balancing based on test timing is, in fact, the opposite of randomness. Underneath, however, they would both rely on taking the test files and turning it into a nested array of which tests run in a shard. If an interface can be agreed on, the randomness option can be a quick solution to prove out the interface for balancing.
It feels as though randomness could be achieved in a much simpler way though—e.g. by having a command line argument to specify a seed value?
Hi, I’m among those hoping this feature will eventually be integrated into Playwright. I’ve seen a lot of activity and PRs around this, and I want to express my respect for all the hard work that’s gone into it.
On my end, I try to make a tool that implements this feature without requiring any modifications to the Playwright core. It also supports Vitest and Jest, not just Playwright. repo: https://github.com/nissy-dev/tenbin
So far, I’ve only tested it with some simple cases, but it’s been working well. Please feel free to try it out if you're interested!
You might want to comment or up-vote comments on what Playwright should concentrate on in 2025:
- https://github.com/microsoft/playwright/issues/33955#issuecomment-2562679995
Hey all, I wanted to update this thread with an option that doesn't embed this functionality into Playwright itself, which we use today to provide the functionality within our projects. The general workflow is:
- Remove any usage of Playwright's internal sharding, in favor of the CI provider's parallelism strategy
- Ensure all Playwright executions export into a standard format (i.e. JUnit XML)
- Store the reports for each run, particularly against the main branch.
- invoke the CI provider's unique tool
While I understand the trade-off here that moving between CI providers is less trivial, this gives all the functionality you're looking for now without needing to wait for Playwright to provide this support internally. Even if sharding added better timing support, we already would need to leverage the CI provider's support for storage of test results to re-gather this data in most cases. Plus, as mentioned in https://github.com/microsoft/playwright/issues/17969#issuecomment-2309619576, any given team might want to apply their own heuristic associated with the test results, such as using the most recent 5 results to smooth out timing data.
I would also note that, for any shops that have multiple languages, consolidating to a single report format (and tool) for test reporting is very advantageous.
Example: CircleCI
For CircleCI, you can make use of the circleci tests glob and circleci tests run
playwright:
executor: base-executor
resource_class: medium+
working_directory: app
parallelism: 4
steps:
- run:
name: Run playwright against << parameters.environment >>
command: |
PLAYWRIGHT_COMMAND="yarn e2e:preprod ..."
TESTFILES=$(circleci tests glob "playwright/tests/**/*.test.ts")
echo $TESTFILES | circleci tests run --command="xargs $PLAYWRIGHT_COMMAND" --verbose --split-by=timings
- store_test_results:
path: app/reports
NOTE: our Playwright configuration explicitly enables
junitreporting to a fixed output location
Example: GitHub actions
This example action (shoutout @r7kamura) makes use of the split-test CLI to always be able to Glob a set of test files and cross-reference JUnit XML data outward. I'm sure there are a myriad of ways to replicate the same functionality.
@ffluk3 thanks for sharing your use-case which is really helpful to workaround the current limitations of playwright test runner.
Am I correct to assume that this implies splitting of tests is performed per testfile instead of test case?
Playwrights configuration allows for multiple projects (e.g. different browser versions, locales, test user accounts, etc) which can have different globs for test files and of course test files can have varying number of tests. With current playwright sharding we can say 1/4 of all test cases should run on a given shard and it does that across all project configurations and discovered tests… It does look like your loosing that to some extend when using your proposed workaround. WDYT?
@muhqu Yes that is totally fair - the approach we take imposes some limitations depending on the structure used, and for our use cases we stick to a single browser. I would argue that one could structure a set of projects to also run as different jobs and each use the test splitting, if that option were available, but that further removes the ability to use playwright's built-in orchestration tooling. Thus far for us, we have not seen issue with having our own orchestration on top of running single-project configurations per job, but I could see the use case.
@ffluk3 but I think what playwright could do, to work even better the way you’re using it, would be to improve the playwright test -list command to produce output which can later be consumed by playwright test. Then you would not need to split based on files and could instead split based on actual test cases. 🤔
Playwright does have a --list flag, I'll see if I can fit it better into my approach.
Here are a few options people might find helpful. Run e2e tests split by timing (per testcase) is the one really relevant for here.
However, in this example the test case is identified using the line number in the test file, which is not ideal. This approach means that if you modify your tests, the line numbers will change, causing previous timing data to be lost.
Additionally, tests across different browsers won’t be split. Each test will run for all configured browsers on the same machine, rather than being distributed separately.
CircleCI Configuration
- run:
name: Run e2e tests split by timing (per testcase)
command: npx playwright test --list | awk -F' › ' '{print $2}' | sort -u | sed 's/:[0-9]\+$//' | circleci tests run --command="xargs npx playwright test" --verbose --split-by=timings
- run:
name: Run e2e tests split by timing (per test file)
command: circleci tests glob "playwright/**/*.spec.ts" | circleci tests run --command="xargs npx playwright test" --verbose --split-by=timings
- run:
name: Run e2e tests split by shards (per testcase)
command: SHARD="$((${CIRCLE_NODE_INDEX}+1))"; npx playwright test --shard=${SHARD}/${CIRCLE_NODE_TOTAL}
Update
The above Run e2e tests split by timing (per testcase) doesn't actually work. Because circle ci matches the input to the circleci test run command with the attributes in the junit report. And here, you'll not find the line numbers.
Here's what we ended up doing.
- List the playwright tests with
npx playwright test --list - Extract the test-names (Everything after the second ›)
- Write these into a file (we had problems to pass them to circleci test split, because of the spaces. An intermediate file worked for us)
- Split the file via
circleci tests split --split-by=timings --timings-type=testname testnames.txt - List the testcases again via
npx playwright test --listand match the names from splitting to the filename + linenumber and pass this tonpx playwright test
Here is an example with processing in node:
- run:
name: Prepare temporary list of all playwright tests
command: node tools/playwright/list-test-names.js | tee testnames.txt
- run:
name: Run e2e tests split by timing (per testcase)
command: circleci tests split --split-by=timings --timings-type=testname testnames.txt | node tools/playwright/extract-testcase-lines-by-test-names.js | xargs npx playwright test
list-test-names.js
const { execSync } = require('child_process');
const rawTestList = execSync('npx playwright test --list', { encoding: 'utf8' });
const testList = rawTestList
.split('\n')
.filter((line) => line.includes('›'))
.map((line) => line.split(' › ').slice(2).join(' › '));
const finalList = testList.join('\n');
console.log(finalList);
extract-testcase-lines-by-test-names.js
const listOfAllTests = require('child_process')
.execSync('npx playwright test --list', { encoding: 'utf8' })
.split('\n')
.filter((line) => line.includes('›'))
.map((line) => line.trim());
// read full input stream
const stdinBuffer = require('fs').readFileSync(0).toString().trim();
stdinBuffer.split('\n').forEach((testCaseLine) => {
// check if the regex matches any of the testcases
const relevantTestcases = listOfAllTests.filter((test) => test.includes(testCaseLine));
if (relevantTestcases.length === 0) {
throw new Error(`No test found for: ${testCaseLine}`);
}
if (relevantTestcases.length > 1) {
console.log(relevantTestcases);
throw new Error(`Multiple tests found for: ${testCaseLine}... Choose a different name`);
}
const relevantTestcase = relevantTestcases[0];
const regex = /\s(\S+:\d+:\d+)/;
// extract the file and line number from the first match
const testFileWithLineFromFirstMatchgroup = relevantTestcase.match(regex)[1];
console.log(testFileWithLineFromFirstMatchgroup);
});
I hope this workaround helps someone.
@dgozman any chance to get any further with this?
It's been over a year that I created a PR which supports duration-round-robin sharding and several teams at our company (adobe) are using this via patch-package.
- https://github.com/microsoft/playwright/pull/30962
Just now, I created an updated patch for playwright 1.54.1 as your users asked for it…
@dgozman Sad that the --filter implementation has been rolled-back. Is there any existing alternative or why has it been removed?
@sizzle168 We've decided to replace it with --last-run-file option instead. See #37209. You can specify filterTests property there and the filter will be applied.
We assume this should be easy enough to generate with a script, by having a custom reporter and running npx playwright test --list --reporter=./myreporter to generate a list of test ids according to any criteria you'd like.
It would be great if you could give it a try by installing a canary release and share your feedback before the next stable release. Thank you!
I am finding this a major obstacle with Playwright as well. I have seen shards unbalanced by as much as 10x, and am having to do a lot of custom work to rebalance them. I know the playwright team wants to make this pluggable so people can define their own algorithms, but in the meantime Playwright seriously underperforms. I tried adapting @muhqu's patch to 1.55.0 but ran into errors on running merge-reports, so am now forced to consider a downgrade to 1.54.1.
I have seen this recently but haven't tried it: https://docs.currents.dev/guides/ci-optimization/playwright-parallelization#playwright-orchestration. Paid service though.
@dgozman as I understand it, to implement timings-based shard balancing, we essentially have to script the balancing ourselves by filling in different tests in a last-run file for each shard, and not using built-in sharding
Does this have any implications regarding report merging? I'm using playwright in Github Actions and I upload the merged report to github pages, which is really handy. will this still work if I switch to --last-run-file-based sharding instead of --shard=n/m ?
if yes, I think i'll package up a time balancing implementation so thats its easy to just npm-install and we dont have to all write the same script :)
The solutions mentioned above for CircleCI do not work correctly (other than patching Playwright).
This is because Circleci depends on the filename attribute in the JUnit output, but for some reason Playwright seem to think that this entirely standard field is actually a "CI provider specific" thing and won't support it in the their built in reporter.
I think actually Cypress has a similar limitation which we fixed using this package https://github.com/ksocha/cypress-circleci-reporter
I've ported that reporter to Playwright here: https://npmjs.com/settings/alexstapleton/packages and that does seem to actually work to get the binpacking to work with the CircleCI tools now. I have to run it like this to get the paths to line up but YMMV
TESTFILES=$(circleci tests glob "packages/main/tests/e2e/**/*.spec.ts" | sed "s|^|$(pwd)/|" | circleci tests split --split-by=timings | sed "s|^$(pwd)/packages/main/||")
NODE_ENV=test pnpm --dir packages/main run playwright:run --max-failures 5 $TESTFILES
@gpaciga, I've created a new patch for playwright 1.56.1 with the changes from:
- https://github.com/microsoft/playwright/pull/30962
Complete diff: https://github.com/microsoft/playwright/compare/v1.56.1...muhqu:playwright:sharding-algorithm-v1.56.1