Add Playwright chart screenshot tests
Work-in-progress, following up from #261
Without any max pixel differences set we get the following errors in CI (screenshots generated locally on my machine):
/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 516 pixels (ratio 0.01 of all image pixels /example/AreaStacked 274 pixels (ratio 0.01 of all image pixels)
I was previously guessing that the ratio of pixels might be related to 'visible' (or non-background) pixels. That is not the case.
The ratio is calculated in Playwright here:
const ratio = Math.ceil(count / (expected.width * expected.height) * 100) / 100;
https://github.com/microsoft/playwright/blob/1924b51d3f536a51cc2386c119ab922250dbf783/packages/playwright-core/src/server/utils/comparators.ts#L90
For both 11 and 561 pixels the formula gives 0.01 for an image size of 800x201:
> Math.ceil(11 / (800 * 201) * 100) / 100
0.01
> Math.ceil(516 / (800 * 201) * 100) / 100
0.01
With a threshold of 0.25 the pixel differences are:
/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 327 pixels (ratio 0.01 of all image pixels) /example/AreaStacked 90 pixels (ratio 0.01 of all image pixels)
For a 0.3 threshold (Artifact Download):
/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 272 pixels (ratio 0.01 of all image pixels) /example/AreaStacked 44 pixels (ratio 0.01 of all image pixels)
0.3 was suggested in this Playwright issue thread
Trying Firefox, it has lots of failing screenshot comparisons (compared with locally generated ones). Many minor font differences.
There are lots of discussions to use chrome flags to minimise rendering differences but this didn't change the differences (being set as half-pixel values or percentages with multiple digits, e.g. grid line position).
Here is an example discussion: https://stackoverflow.com/questions/79094715/disable-hardware-influence-on-playwright-tests-using-chromium-driver
Trying different headless modes or a newer version didn't solve the problem either.
This is the next one on the list right?
This is the next one on the list right?
Unfortunately not ready yet - I'll write more on this soon.
Unfortunately not ready yet - I'll write more on this soon.
Currently, I am not sure if it's possible to have these test running in CI with a good developer experience, either one gets incorrectly reported test failures (like from slightly shifted axis labels or lines) or larger changes might be missed.
I have not found a combination of pixelmatch threshold or relative or absolute pixel counts that works well.
For example, when I merge this into #281 to test the components I have to lower the pixelmatch threshold to 0.1 to get a failing test for a clearly off Sankey chart.
I think the very light color is considered too visually similar by the algorithm.
For other components with problems, like with CirclePack, the failure is detected, even with a higher threshold.
However, with a low threshold of 0.1 (or even 0) there are then all the failures reported where there is a half-pixel shift like for AreaStacked and others.
Having read a lot of threads on this and strategies tried, e.g. hiding some parts with CSS or people have tried to apply blurring to minimize the effect of half-pixel shifts or minor browser differences, I am not sure if this would cause too much friction.
It's annoying if PRs are failing due to unrelated changes and/or are failing locally and one has to click through to check. The workflow in a PR with having to download the report from an artifact URL is also cumbersome.
I still think that these screenshots can be extremely useful, especially when working on larger scale changes in Layer Cake.
I thought it might be useful to have the failing screenshots attached as a PR comment so that they could quickly be checked, but I'm not sure if it's possible to have this running without using an external service for hosting the images.
Another option might be to simply skip the tests when running in GitHub CI. They still would be very useful when running and developing locally.
It might also be worth adding a screenshot test for https://layercake.graphics/components or https://layercake.graphics as there all (?) charts on a single page. This would speed up the testing and one probably could still skim a diff quickly.
Thanks for the write-up and for wrestling with it. That makes sense to me. Let's keep it as a gut-check to make sure that the svelte-5 migration doesn't introduce glaring errors but, it seems to brittle to introduce as a CI step.
Should we wrap this up as a local test function and merge it?
Should we wrap this up as a local test function and merge it?
I started looking into it, they need to go into a separate directory I think.
I came across jest-image-snapshot and its Playwright port and saw their usage of glur for blurring to facilitate image comparison. They also use a quite low pixelmatch threshold.
This made me think about CSS options for blurring with Playwright and that was actually quite straightforward. With some blur applied all breaking pages from #281 get correctly detected with this PR. The minor anti-aliasing pixel shifts in axis tick labels or grid lines are blurred away.
So I think this might actually work in CI as well. Give it a spin on your machine, it might require further tweaking of the blur or threshold values.
I've experimented with generating the chart screenshots in CI (if they are missing). This should minimize differences and would enable a workflow where one would delete the current screenshots if there are desired changes and they would automatically be generated.
I'm not sure if this works from a pull request (even with allowed edits), so I'd probably not directly include this in this PR, but this could be useful for tweaking the thresholds I think .
See https://github.com/rgieseke/layercake/pull/1
I've cherry-picked the CI-generated screenshots into this PR. When I locally run the tests they pass, but fail with a lower threshold of 0.05, so the values are probably ok.
Sorry for my slowness on this one. I checked it out and it looks good! I had to add this network idle check to get the download test to pass.
The first time I run it, they fail saying there no tests bc there is nothing committed for darwin. Probably best for me to commit darwin playwright screenshots. Otherwise I think this is good to merge!
@rgieseke let me know if there's anything else you want to add to this one. Otherwise, we'll merge!
I think once this is merged, I can finish the rest of the migration this week. I'd love to finish it before I'm set to give a tutorial on Friday.
Just went through it. Looks good to me! We can tweak if need be.
Just went through it. Looks good to me! We can tweak if need be.
Awesome, and "tweak if need be" would have been my reply!