layercake icon indicating copy to clipboard operation
layercake copied to clipboard

Add Playwright chart screenshot tests

Open rgieseke opened this issue 8 months ago • 15 comments

Work-in-progress, following up from #261

rgieseke avatar Apr 28 '25 09:04 rgieseke

Without any max pixel differences set we get the following errors in CI (screenshots generated locally on my machine):

/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 516 pixels (ratio 0.01 of all image pixels /example/AreaStacked 274 pixels (ratio 0.01 of all image pixels)

rgieseke avatar Apr 29 '25 08:04 rgieseke

I was previously guessing that the ratio of pixels might be related to 'visible' (or non-background) pixels. That is not the case.

The ratio is calculated in Playwright here:

  const ratio = Math.ceil(count / (expected.width * expected.height) * 100) / 100;

https://github.com/microsoft/playwright/blob/1924b51d3f536a51cc2386c119ab922250dbf783/packages/playwright-core/src/server/utils/comparators.ts#L90

For both 11 and 561 pixels the formula gives 0.01 for an image size of 800x201:

> Math.ceil(11 / (800 * 201) * 100) / 100
0.01
> Math.ceil(516 / (800 * 201) * 100) / 100
0.01

rgieseke avatar Apr 29 '25 10:04 rgieseke

With a threshold of 0.25 the pixel differences are:

/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 327 pixels (ratio 0.01 of all image pixels) /example/AreaStacked 90 pixels (ratio 0.01 of all image pixels)

rgieseke avatar Apr 29 '25 11:04 rgieseke

For a 0.3 threshold (Artifact Download):

/example-ssr/MultiLine 11 pixels (ratio 0.01 of all image pixels) /example-ssr/AreaStacked 272 pixels (ratio 0.01 of all image pixels) /example/AreaStacked 44 pixels (ratio 0.01 of all image pixels)

0.3 was suggested in this Playwright issue thread

rgieseke avatar Apr 29 '25 11:04 rgieseke

Trying Firefox, it has lots of failing screenshot comparisons (compared with locally generated ones). Many minor font differences.

rgieseke avatar May 04 '25 08:05 rgieseke

There are lots of discussions to use chrome flags to minimise rendering differences but this didn't change the differences (being set as half-pixel values or percentages with multiple digits, e.g. grid line position).

Here is an example discussion: https://stackoverflow.com/questions/79094715/disable-hardware-influence-on-playwright-tests-using-chromium-driver

Trying different headless modes or a newer version didn't solve the problem either.

rgieseke avatar May 04 '25 12:05 rgieseke

This is the next one on the list right?

mhkeller avatar May 13 '25 22:05 mhkeller

This is the next one on the list right?

Unfortunately not ready yet - I'll write more on this soon.

rgieseke avatar May 14 '25 17:05 rgieseke

Unfortunately not ready yet - I'll write more on this soon.

Currently, I am not sure if it's possible to have these test running in CI with a good developer experience, either one gets incorrectly reported test failures (like from slightly shifted axis labels or lines) or larger changes might be missed. I have not found a combination of pixelmatch threshold or relative or absolute pixel counts that works well.

For example, when I merge this into #281 to test the components I have to lower the pixelmatch threshold to 0.1 to get a failing test for a clearly off Sankey chart.

image

I think the very light color is considered too visually similar by the algorithm.

For other components with problems, like with CirclePack, the failure is detected, even with a higher threshold.

image

However, with a low threshold of 0.1 (or even 0) there are then all the failures reported where there is a half-pixel shift like for AreaStacked and others.

Having read a lot of threads on this and strategies tried, e.g. hiding some parts with CSS or people have tried to apply blurring to minimize the effect of half-pixel shifts or minor browser differences, I am not sure if this would cause too much friction.

It's annoying if PRs are failing due to unrelated changes and/or are failing locally and one has to click through to check. The workflow in a PR with having to download the report from an artifact URL is also cumbersome.

I still think that these screenshots can be extremely useful, especially when working on larger scale changes in Layer Cake.

I thought it might be useful to have the failing screenshots attached as a PR comment so that they could quickly be checked, but I'm not sure if it's possible to have this running without using an external service for hosting the images.

Another option might be to simply skip the tests when running in GitHub CI. They still would be very useful when running and developing locally.

It might also be worth adding a screenshot test for https://layercake.graphics/components or https://layercake.graphics as there all (?) charts on a single page. This would speed up the testing and one probably could still skim a diff quickly.

rgieseke avatar May 15 '25 12:05 rgieseke

Thanks for the write-up and for wrestling with it. That makes sense to me. Let's keep it as a gut-check to make sure that the svelte-5 migration doesn't introduce glaring errors but, it seems to brittle to introduce as a CI step.

mhkeller avatar May 15 '25 15:05 mhkeller

Should we wrap this up as a local test function and merge it?

mhkeller avatar May 23 '25 16:05 mhkeller

Should we wrap this up as a local test function and merge it?

I started looking into it, they need to go into a separate directory I think.

rgieseke avatar May 25 '25 20:05 rgieseke

I came across jest-image-snapshot and its Playwright port and saw their usage of glur for blurring to facilitate image comparison. They also use a quite low pixelmatch threshold.

This made me think about CSS options for blurring with Playwright and that was actually quite straightforward. With some blur applied all breaking pages from #281 get correctly detected with this PR. The minor anti-aliasing pixel shifts in axis tick labels or grid lines are blurred away.

So I think this might actually work in CI as well. Give it a spin on your machine, it might require further tweaking of the blur or threshold values.

rgieseke avatar May 26 '25 20:05 rgieseke

I've experimented with generating the chart screenshots in CI (if they are missing). This should minimize differences and would enable a workflow where one would delete the current screenshots if there are desired changes and they would automatically be generated.

I'm not sure if this works from a pull request (even with allowed edits), so I'd probably not directly include this in this PR, but this could be useful for tweaking the thresholds I think .

See https://github.com/rgieseke/layercake/pull/1

rgieseke avatar May 28 '25 11:05 rgieseke

I've cherry-picked the CI-generated screenshots into this PR. When I locally run the tests they pass, but fail with a lower threshold of 0.05, so the values are probably ok.

rgieseke avatar May 28 '25 11:05 rgieseke

Sorry for my slowness on this one. I checked it out and it looks good! I had to add this network idle check to get the download test to pass.

The first time I run it, they fail saying there no tests bc there is nothing committed for darwin. Probably best for me to commit darwin playwright screenshots. Otherwise I think this is good to merge!

mhkeller avatar Jul 12 '25 14:07 mhkeller

@rgieseke let me know if there's anything else you want to add to this one. Otherwise, we'll merge!

mhkeller avatar Jul 12 '25 15:07 mhkeller

I think once this is merged, I can finish the rest of the migration this week. I'd love to finish it before I'm set to give a tutorial on Friday.

mhkeller avatar Jul 13 '25 03:07 mhkeller

Just went through it. Looks good to me! We can tweak if need be.

mhkeller avatar Jul 13 '25 15:07 mhkeller

Just went through it. Looks good to me! We can tweak if need be.

Awesome, and "tweak if need be" would have been my reply!

rgieseke avatar Jul 13 '25 16:07 rgieseke