drake RenderEngineVtk unit test failure on Mac M1

In our Mac M1 CI testing, the //geometry/render_vtk:internal_render_engine_vtk_test unit test is failing.

We need to figure out why it's failing and then figure out the best way to make it pass (or disable it?).

See here https://github.com/RobotLocomotion/drake/issues/17009#issuecomment-1113582278 for the full error message.

Jul 13 '22 17:07 jwnimmer-tri

To +@BetsyMcPhail for delegation.

Jul 13 '22 17:07 jwnimmer-tri

In #17568, I've disabled the test case so that we can turn on M1 in Production. As part of resolving this issue, we should re-enable this test.

I initially attempted to only disable the two unit test cases that were failing, but when I did, other different cases started failing.

Jul 14 '22 17:07 jwnimmer-tri

To do: move M1 into Production now. Then also start trying to debug this unit test.

Jul 21 '22 14:07 jwnimmer-tri

One debugging idea: see TEST_UNDECLARED_OUTPUTS_DIR documented at https://bazel.build/reference/test-encyclopedia. If the unit test saves the rendered images to that folder, they would remain on disk after the test concludes, in the bazel-testlogs directory.

Jul 21 '22 20:07 jwnimmer-tri

It is entirely not obvious what the underlying cause is here, we know the tests work on a non-virtualized m1 mac.

https://github.com/RobotLocomotion/drake/blob/08ca165f78e02be3a81e05f708a4c975c3120cef/geometry/render_vtk/test/internal_render_engine_vtk_test.cc#L521

The TerrainTest reliably fails when the camera is set to depth=4.999f things break. A value of 4.4 does not (maybe we are just hitting a subtle threshold problem, there are comments about different thresholds elsewhere for linux and mac).

The confusing part is that in CI on orka it will get the background color rather than rendering the terrain (background is orange, terrain is always black).

Linux (good)	Orka (bad)

The terran "closer than near" test also fails for the same reason, on linux the image is black and in orka CI it is orange.

I initially attempted to only disable the two unit test cases that were failing, but when I did, other different cases started failing.

I haven't experienced this exactly, but in repeated testing I have been able to every now and then get a different test to fail. They all seem to relate to spheres and opacity (possible red herring), but I'm under the impression that there's something special about the opengl setup or environment virtualization in the orka CI that is causing this.

One thought we had: would it make sense to

Just filter out the terrain and horizon test since they're known bad, rather than all, and
If needed supposedly there is a bazel way to mark some tests as flaky?

A handful of the rendering tests seem to reliably work, but it's very unclear what could be causing this test failure. I don't believe drake or vtk are to blame here.

Files for myself in helping debug this:

save_render_vtk_images.patch.txt

diff_me.py.txt

Aug 11 '22 21:08 svenevs

... we know the tests work on a non-virtualized m1 mac.

Interesting. Running while ssh'd into an native (non-virtualized) M1 mac, I recall that they were failing for me.

If they really do pass locally, then I'm fine with just disabling them for CI under the umbrella of "cloud problems".

And yes, in that case, we should change the gtest filter to only skip the failing test cases, instead of all test cases.

Aug 18 '22 13:08 jwnimmer-tri

drake drake copied to clipboard

RenderEngineVtk unit test failure on Mac M1

drake
drake copied to clipboard