drake icon indicating copy to clipboard operation
drake copied to clipboard

//tutorials:py/hydroelastic_contact_basics_test failure in CI build

Open ggould-tri opened this issue 1 year ago • 19 comments

What happened?

//tutorials:py/hydroelastic_contact_basics_test failed on a PR build, but succeeded in two rebuilds.

https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-everything-release/5204/consoleFull

This is a first occurrence so I am closing this issue for now.

Version

No response

What operating system are you using?

Ubuntu 22.04

What installation option are you using?

No response

Relevant log output

[12:03:45 PM]  ==================== Test output for //tutorials:py/hydroelastic_contact_basics_test:
[12:03:45 PM]  [IPKernelApp] WARNING | debugpy_stream undefined, debugging will not be enabled
[12:03:45 PM]  Running notebook as a test (non-interactive)
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 626, in _async_poll_for_reply
[12:03:45 PM]      msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 96, in ensure_async
[12:03:45 PM]      result = await obj
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/jupyter_client/channels.py", line 224, in get_msg
[12:03:45 PM]      ready = await self.socket.poll(timeout)
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 838, in async_execute_cell
[12:03:45 PM]      exec_reply = await self.task_poll_for_reply
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24199/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 11, in <module>
[12:03:45 PM]      main()
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24199/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 7, in main
[12:03:45 PM]      _jupyter_bazel_notebook_main("drake/tutorials/hydroelastic_contact_basics.ipynb", sys.argv[1:])
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24199/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 80, in _jupyter_bazel_notebook_main
[12:03:45 PM]      ep.preprocess(nb, resources={'metadata': {'path': notebook_dir}})
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
[12:03:45 PM]      self.preprocess_cell(cell, resources, index)
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24199/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 34, in preprocess_cell
[12:03:45 PM]      return super().preprocess_cell(*args, **kwargs)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
[12:03:45 PM]      cell = self.execute_cell(cell, index, store_history=True)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 84, in wrapped
[12:03:45 PM]      return just_run(coro(*args, **kwargs))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 62, in just_run
[12:03:45 PM]      return loop.run_until_complete(coro)
[12:03:45 PM]    File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[12:03:45 PM]      return future.result()
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 842, in async_execute_cell
[12:03:45 PM]      raise DeadKernelError("Kernel died")
[12:03:45 PM]  nbclient.exceptions.DeadKernelError: Kernel died
[12:03:45 PM]  ================================================================================
[12:03:45 PM]  ==================== Test output for //tutorials:py/hydroelastic_contact_basics_test:
[12:03:45 PM]  [IPKernelApp] WARNING | debugpy_stream undefined, debugging will not be enabled
[12:03:45 PM]  Running notebook as a test (non-interactive)
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 626, in _async_poll_for_reply
[12:03:45 PM]      msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 96, in ensure_async
[12:03:45 PM]      result = await obj
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/jupyter_client/channels.py", line 224, in get_msg
[12:03:45 PM]      ready = await self.socket.poll(timeout)
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 838, in async_execute_cell
[12:03:45 PM]      exec_reply = await self.task_poll_for_reply
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24235/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 11, in <module>
[12:03:45 PM]      main()
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24235/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 7, in main
[12:03:45 PM]      _jupyter_bazel_notebook_main("drake/tutorials/hydroelastic_contact_basics.ipynb", sys.argv[1:])
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24235/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 80, in _jupyter_bazel_notebook_main
[12:03:45 PM]      ep.preprocess(nb, resources={'metadata': {'path': notebook_dir}})
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
[12:03:45 PM]      self.preprocess_cell(cell, resources, index)
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24235/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 34, in preprocess_cell
[12:03:45 PM]      return super().preprocess_cell(*args, **kwargs)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
[12:03:45 PM]      cell = self.execute_cell(cell, index, store_history=True)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 84, in wrapped
[12:03:45 PM]      return just_run(coro(*args, **kwargs))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 62, in just_run
[12:03:45 PM]      return loop.run_until_complete(coro)
[12:03:45 PM]    File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[12:03:45 PM]      return future.result()
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 842, in async_execute_cell
[12:03:45 PM]      raise DeadKernelError("Kernel died")
[12:03:45 PM]  nbclient.exceptions.DeadKernelError: Kernel died
[12:03:45 PM]  ================================================================================
[12:03:45 PM]  ==================== Test output for //tutorials:py/hydroelastic_contact_basics_test:
[12:03:45 PM]  [IPKernelApp] WARNING | debugpy_stream undefined, debugging will not be enabled
[12:03:45 PM]  Running notebook as a test (non-interactive)
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 626, in _async_poll_for_reply
[12:03:45 PM]      msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 96, in ensure_async
[12:03:45 PM]      result = await obj
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/jupyter_client/channels.py", line 224, in get_msg
[12:03:45 PM]      ready = await self.socket.poll(timeout)
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 838, in async_execute_cell
[12:03:45 PM]      exec_reply = await self.task_poll_for_reply
[12:03:45 PM]  asyncio.exceptions.CancelledError
[12:03:45 PM]  
[12:03:45 PM]  During handling of the above exception, another exception occurred:
[12:03:45 PM]  
[12:03:45 PM]  Traceback (most recent call last):
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24256/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 11, in <module>
[12:03:45 PM]      main()
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24256/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tutorials/hydroelastic_contact_basics_jupyter_py_main.py", line 7, in main
[12:03:45 PM]      _jupyter_bazel_notebook_main("drake/tutorials/hydroelastic_contact_basics.ipynb", sys.argv[1:])
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24256/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 80, in _jupyter_bazel_notebook_main
[12:03:45 PM]      ep.preprocess(nb, resources={'metadata': {'path': notebook_dir}})
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
[12:03:45 PM]      self.preprocess_cell(cell, resources, index)
[12:03:45 PM]    File "/media/ephemeral0/ubuntu/workspace/linux-jammy-gcc-bazel-experimental-everything-release/_bazel_ubuntu/d5e9f70b55234713aa722cc9d6b555d5/sandbox/linux-sandbox/24256/execroot/_main/bazel-out/k8-opt/bin/tutorials/py/hydroelastic_contact_basics_test.runfiles/_main/tools/jupyter/jupyter_bazel.py", line 34, in preprocess_cell
[12:03:45 PM]      return super().preprocess_cell(*args, **kwargs)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
[12:03:45 PM]      cell = self.execute_cell(cell, index, store_history=True)
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 84, in wrapped
[12:03:45 PM]      return just_run(coro(*args, **kwargs))
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/util.py", line 62, in just_run
[12:03:45 PM]      return loop.run_until_complete(coro)
[12:03:45 PM]    File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[12:03:45 PM]      return future.result()
[12:03:45 PM]    File "/usr/lib/python3/dist-packages/nbclient/client.py", line 842, in async_execute_cell
[12:03:45 PM]      raise DeadKernelError("Kernel died")
[12:03:45 PM]  nbclient.exceptions.DeadKernelError: Kernel died
[12:03:45 PM]  ================================================================================

ggould-tri avatar Feb 04 '25 22:02 ggould-tri

3/27 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2622/console

Aiden2244 avatar Mar 27 '25 13:03 Aiden2244

3/28 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-everything-release/5866/consoleFull

jwnimmer-tri avatar Mar 28 '25 15:03 jwnimmer-tri

3/28 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-everything-release/5869/consoleFull (twice in a row, for that PR)

jwnimmer-tri avatar Mar 28 '25 15:03 jwnimmer-tri

3/28 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-debug/12161/consoleFull 3/30 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-debug/12163/consoleFull

jwnimmer-tri avatar Mar 29 '25 18:03 jwnimmer-tri

Something has caused this test to become a flaky in the past few days.

@DamrongGuoy please see if you can figure out the problem, e.g., by reproducing the failure locally.

jwnimmer-tri avatar Mar 31 '25 00:03 jwnimmer-tri

3/28: https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-debug/12154/consoleFull 3/28: https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-everything-release/5875/consoleFull 3/28: https://drake-jenkins.csail.mit.edu/job/linux-jammy-clang-bazel-experimental-everything-release/7835/consoleFull

jwnimmer-tri avatar Mar 31 '25 13:03 jwnimmer-tri

3/29: https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2627/ 3/29: https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-nightly-everything-debug/403/ (Timeout) 3/30: https://drake-jenkins.csail.mit.edu/job/linux-jammy-clang-bazel-nightly-everything-debug/406/

BetsyMcPhail avatar Mar 31 '25 14:03 BetsyMcPhail

3/31: https://drake-jenkins.csail.mit.edu/job/linux-jammy-clang-bazel-continuous-everything-release/1558 3/31: https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-everything-release/1136

BetsyMcPhail avatar Mar 31 '25 14:03 BetsyMcPhail

3/31 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2629/ 3/31 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2630/

Aiden2244 avatar Mar 31 '25 15:03 Aiden2244

Looking at CDash history, the earliest instance of this error message I can find was on 3/25 at https://drake-cdash.csail.mit.edu/tests/1763068472 on behalf of the PR #22778 build of https://github.com/RobotLocomotion/drake/commit/934be006075f90c31fe18e45e684e9ad06d80d30. See https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-experimental-debug/12102/consoleFull. That puts master at #22806 as the baseline.

jwnimmer-tri avatar Mar 31 '25 15:03 jwnimmer-tri

Actually on 3/20 we hit it https://drake-cdash.csail.mit.edu/tests/1758481279 on https://github.com/RobotLocomotion/drake/pull/22785, with a master baseline of #22772.

jwnimmer-tri avatar Mar 31 '25 15:03 jwnimmer-tri

William shared a helpful CDash summary:

https://drake-cdash.csail.mit.edu/queryTests.php?project=Drake&begin=2024-01-01&end=2025-03-31&filtercount=2&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=%2F%2Ftutorials%3Apy%2Fhydroelastic_contact_basics_test&field2=status&compare2=61&value2=Failed

BetsyMcPhail avatar Mar 31 '25 16:03 BetsyMcPhail

Filtered with test output:

https://drake-cdash.csail.mit.edu/queryTests.php?project=Drake&begin=2024-01-01&end=2025-03-31&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=%2F%2Ftutorials%3Apy%2Fhydroelastic_contact_basics_test&field2=status&compare2=61&value2=Failed&field3=testoutput&compare3=95&value3=nbclient.exceptions.DeadKernelError%3A%20Kernel%20died

1/24 - 1/30 there were quite a few cases 2/4 - one failure (this issue was opened) 3/18 - now, lots of failures again

BetsyMcPhail avatar Mar 31 '25 16:03 BetsyMcPhail

I have tried a few things to repro locally, to no avail. Possibly someone else will be more lucky.

Note that the CI failures take only about 5 seconds to happen, whereas a passing test takes around 20-25 seconds in release builds and much longer in debug builds, so that indicates the crash is happening pretty early during startup, possibly even during noteobok boot-up before import pydrake, or during import pydrake itself.

Besides trying to repro, possible next steps are:

  • Disable the test on master (to prevent it interfering with unrelated PRs).
  • Try to add more information to the error / error handling to get a better handle on exactly what is happening.

jwnimmer-tri avatar Mar 31 '25 16:03 jwnimmer-tri

@jwnimmer-tri is this related to what you said "happening pretty early during startup, possibly even during noteobok boot-up before import pydrake, or during import pydrake itself" or not?

/linux-jammy-gcc-bazel-experimental-everything-release/src/bindings/pydrake/BUILD.bazel:453:22: GenerateMypyStubs bindings/pydrake/pydrake/autodiffutils.pyi failed: (Exit 1): stubgen failed: error executing GenerateMypyStubs command

I saw it in https://drake-cdash.csail.mit.edu/builds/1809879

DamrongGuoy avatar Mar 31 '25 19:03 DamrongGuoy

That's not related.

jwnimmer-tri avatar Mar 31 '25 19:03 jwnimmer-tri

3/31 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2631/ 3/31 https://drake-jenkins.csail.mit.edu/job/linux-jammy-gcc-bazel-continuous-debug/2632/

BetsyMcPhail avatar Mar 31 '25 20:03 BetsyMcPhail

I'll take this over @DamrongGuoy. At least to the point of getting a reproducible result / experiment.

jwnimmer-tri avatar Mar 31 '25 20:03 jwnimmer-tri

Thank you for not letting me go into the wrong log file.

After reading the right log file, I still can't reproduce it. Please let me know when we know how to do it. I'll also try a few more things.

DamrongGuoy avatar Mar 31 '25 20:03 DamrongGuoy

Seems like //tutorials:py/hydroelastic_contact_nonconvex_mesh_test failed with a similar error (8/14): https://drake-jenkins.csail.mit.edu/view/Production/job/linux-noble-unprovisioned-gcc-bazel-nightly-release/59/

tyler-yankee avatar Aug 14 '25 13:08 tyler-yankee

8/21 //tutorials:py/hydroelastic_contact_nonconvex_mesh_test failed again in https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/job/linux-noble-clang-bazel-nightly-everything-debug/65/

BetsyMcPhail avatar Aug 21 '25 14:08 BetsyMcPhail

Next time it happens on //tutorials:py/hydroelastic_contact_nonconvex_mesh_test, please open a new issue for that notebook. It's different than the notebook blamed in this issue.

The hydroelastic_contact_basics_test has not failed recently, so closing this as "not planned".

jwnimmer-tri avatar Aug 24 '25 17:08 jwnimmer-tri

The hydroelastic_contact_basics_test has not failed recently, so closing this as "not planned".

Oops. It hasn't failed because it's disabled.

=> #23629

jwnimmer-tri avatar Oct 20 '25 22:10 jwnimmer-tri

Will monitor for flakiness issues after re-enabling.

tyler-yankee avatar Oct 21 '25 13:10 tyler-yankee