documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Kernel sometimes dies with notebook executor

Open Eric-Arellano opened this issue 1 year ago • 3 comments

For example https://github.com/Qiskit/documentation/actions/runs/12015838807/job/33494742907

task: <Task finished name='Task-35' coro=<execute_notebook() done, defined at /home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py:253> exception=DeadKernelError('Kernel died')>
Traceback (most recent call last):
  File "/home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py", line 268, in execute_notebook
    nb = await _execute_notebook(path, config, working_directory.name)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py", line 346, in _execute_notebook
    await notebook_client.async_execute()
  File "/home/runner/work/documentation/documentation/.tox/py311/lib/python3.11/site-packages/nbclient/client.py", line 709, in async_execute
    await self.async_execute_cell(

Eric-Arellano avatar Nov 25 '24 18:11 Eric-Arellano

I've seen this a few more times while working on refactoring the notebook tester. We can't tell which notebook it is from the logs as the error doesn't contain any defining information and the notebooks all run asynchronously. We can whittle it down if jobs fail while only running a subset of notebooks.

In this past, this kind of thing has often been related to Aer. For example, https://github.com/Qiskit/qiskit-aer/issues/2232 might be related.

frankharkins avatar Dec 03 '24 15:12 frankharkins

I managed to reproduce a similar problem locally when I added more notebooks to the script.

zmq.error.ZMQError: Too many open files

I fixed this locally by increasing my ulimit to 6000 (ulimit -n 6000). Hopefully we can set this in our action too.

frankharkins avatar Dec 06 '24 17:12 frankharkins

@frankharkins let's keep this open until it's been a few weeks of not seeing it to confirm #2464 did fix the issue.

Eric-Arellano avatar Dec 16 '24 14:12 Eric-Arellano

We haven't noticed this in a while. I think https://github.com/Qiskit/documentation/pull/3143 helped a lot. Thanks @frankharkins!

Eric-Arellano avatar Jul 30 '25 21:07 Eric-Arellano