[BUG] Unit tests time out randomly on ubuntu-latest
Description of the error
Our unit test workflow seems to get stuck on the ubuntu-latest runner. We have 4 workflows on ubuntu-latest, for different versions of Python. Sometimes one out of four (or, rarely, two out of four https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13350554804 ) ends up marked as failed, with the error:
The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes.
Describe the expected result Most of the time, if the tests all passed on other platform, they should also pass on Linux.
Describe the actual result
Usually, the test all complete without problems. Sometimes, the unit tests stop running after the first file. The output is:
Analysis/test_average_structure.py ................ [ 2%]
and the next test in
Analysis/test_dynamics.py
either never starts, or never completes.
In the end the entire workflow times out.
Suggested fix Since it is not reproducible, and seems only to affect one platform, it is difficult to recognise if the error is on our side, or is a problem with the runner. However, if the test fails, it seems to fail always at the same point in the workflow, so maybe the unit tests need to be changed somehow.
From the different point of view, we could also lower the time limit for the workflows, so they fail sooner. We don't normally need more than 20 minutes, so a limit of 30 minutes would be more than enough.
Additional details N/A
It seems that the tests always get stuck on the Ubuntu runner, but not always on the same test after all. Last failure:
Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items
Analysis/test_average_structure.py ................ [ 2%]
Analysis/test_dynamics.py .............................................. [ 8%]
Error: The operation was canceled.
And another one, also on Ubuntu:
Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items
Analysis/test_average_structure.py ................ [ 2%]
Analysis/test_dynamics.py .............................................. [ 8%]
.............................................. [ 15%]
Analysis/test_infrared.py .. [ 15%]
Analysis/test_mcstas.py x [ 15%]
Analysis/test_mdmc_h5md.py .. [ 16%]
Analysis/test_meansquare.py .. [ 16%]
Analysis/test_mock_dynamics.py ...... [ 17%]
Analysis/test_molecule_names.py .. [ 17%]
Analysis/test_qvectors.py .. [ 17%]
Analysis/test_resolutions.py ....... [ 18%]
Analysis/test_scattering.py .................. [ 21%]
Error: The operation was canceled.
Suggest we might try adding a global timeout to the test so we might actually get some debug information rather than GitHub killing the runner?
We just had another test that got stuck: https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13585226327/job/37978614629
It was a parallel job. The most relevant part of the error output is:
parameters = {'atom_selection': '{}', 'atom_transmutation': '{}', 'frames': 'all', 'grouping_level': 'atom', ...}
traj_info = ('short_traj', '/home/runner/work/MDANSE/MDANSE/MDANSE/Tests/UnitTests/Analysis/../Converted/short_trajectory_after_changes.mdt')
job_info = ('MeanSquareDisplacement', ['msd'], False)
running_mode = ('multicore', -4), output_format = 'MDAFormat'
@pytest.mark.parametrize("traj_info,job_info,running_mode,output_format", total_list)
def test_dynamics_analysis(
parameters, traj_info, job_info, running_mode, output_format
):
temp_name = tempfile.mktemp()
parameters["trajectory"] = traj_info[1]
parameters["running_mode"] = running_mode
parameters["output_files"] = (temp_name, (output_format,), "INFO")
job = IJob.create(job_info[0])
> job.run(parameters, status=True)
Analysis/test_dynamics.py:189:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/MDANSE/Framework/Jobs/IJob.py:423: in run
IJob._runner[mode](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <MDANSE.Framework.Jobs.MeanSquareDisplacement.MeanSquareDisplacement object at 0x7fe9b50782e0>
def _run_multicore(self):
if hasattr(self._status, "_queue_0"):
self._status._queue_0.put("started")
inputQueue = self.inputQueue
outputQueue = self.outputQueue
log_queue = self.log_queue
log_queues = [log_queue]
handlers = [] # handlers that are not QueueHandlers
for handler in LOG.handlers:
if isinstance(handler, QueueHandler):
log_queues.append(handler.queue)
else:
handlers.append(handler)
listener = QueueListener(log_queue, *handlers, respect_handler_level=True)
listener.start()
self._processes = []
for i in range(self.numberOfSteps):
inputQueue.put(i)
for i in range(self.configuration["running_mode"]["slots"]):
self._run_multicore_check_terminate(listener)
p = multiprocessing.Process(
target=self.process_tasks_queue,
args=(inputQueue, outputQueue, log_queues),
)
self._processes.append(p)
p.daemon = False
p.start()
n_results = 0
while n_results != self.numberOfSteps:
self._run_multicore_check_terminate(listener)
if self._status is not None:
self._status.fixed_status(n_results)
try:
index, result = outputQueue.get_nowait()
except queue.Empty:
> time.sleep(0.1)
E Failed: Timeout >240.0s