Kernel sometimes dies with notebook executor
For example https://github.com/Qiskit/documentation/actions/runs/12015838807/job/33494742907
task: <Task finished name='Task-35' coro=<execute_notebook() done, defined at /home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py:253> exception=DeadKernelError('Kernel died')>
Traceback (most recent call last):
File "/home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py", line 268, in execute_notebook
nb = await _execute_notebook(path, config, working_directory.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/documentation/documentation/scripts/nb-tester/qiskit_docs_notebook_tester/__init__.py", line 346, in _execute_notebook
await notebook_client.async_execute()
File "/home/runner/work/documentation/documentation/.tox/py311/lib/python3.11/site-packages/nbclient/client.py", line 709, in async_execute
await self.async_execute_cell(
I've seen this a few more times while working on refactoring the notebook tester. We can't tell which notebook it is from the logs as the error doesn't contain any defining information and the notebooks all run asynchronously. We can whittle it down if jobs fail while only running a subset of notebooks.
In this past, this kind of thing has often been related to Aer. For example, https://github.com/Qiskit/qiskit-aer/issues/2232 might be related.
I managed to reproduce a similar problem locally when I added more notebooks to the script.
zmq.error.ZMQError: Too many open files
I fixed this locally by increasing my ulimit to 6000 (ulimit -n 6000). Hopefully we can set this in our action too.
@frankharkins let's keep this open until it's been a few weeks of not seeing it to confirm #2464 did fix the issue.
We haven't noticed this in a while. I think https://github.com/Qiskit/documentation/pull/3143 helped a lot. Thanks @frankharkins!