MDANSE [BUG] Unit tests time out randomly on ubuntu-latest

Description of the error Our unit test workflow seems to get stuck on the ubuntu-latest runner. We have 4 workflows on ubuntu-latest, for different versions of Python. Sometimes one out of four (or, rarely, two out of four https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13350554804 ) ends up marked as failed, with the error: The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes.

Describe the expected result Most of the time, if the tests all passed on other platform, they should also pass on Linux.

Describe the actual result Usually, the test all complete without problems. Sometimes, the unit tests stop running after the first file. The output is: Analysis/test_average_structure.py ................ [ 2%] and the next test in Analysis/test_dynamics.py either never starts, or never completes.

In the end the entire workflow times out.

Suggested fix Since it is not reproducible, and seems only to affect one platform, it is difficult to recognise if the error is on our side, or is a problem with the runner. However, if the test fails, it seems to fail always at the same point in the workflow, so maybe the unit tests need to be changed somehow.

From the different point of view, we could also lower the time limit for the workflows, so they fail sooner. We don't normally need more than 20 minutes, so a limit of 30 minutes would be more than enough.

Additional details N/A

Feb 18 '25 12:02 MBartkowiakSTFC

It seems that the tests always get stuck on the Ubuntu runner, but not always on the same test after all. Last failure:

Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items

Analysis/test_average_structure.py ................                      [  2%]
Analysis/test_dynamics.py .............................................. [  8%]
Error: The operation was canceled.

Feb 27 '25 09:02 MBartkowiakSTFC

And another one, also on Ubuntu:

Run cd MDANSE/Tests/UnitTests
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/runner/work/MDANSE/MDANSE/MDANSE
configfile: pyproject.toml
collected 702 items

Analysis/test_average_structure.py ................                      [  2%]
Analysis/test_dynamics.py .............................................. [  8%]
..............................................                           [ 15%]
Analysis/test_infrared.py ..                                             [ 15%]
Analysis/test_mcstas.py x                                                [ 15%]
Analysis/test_mdmc_h5md.py ..                                            [ 16%]
Analysis/test_meansquare.py ..                                           [ 16%]
Analysis/test_mock_dynamics.py ......                                    [ 17%]
Analysis/test_molecule_names.py ..                                       [ 17%]
Analysis/test_qvectors.py ..                                             [ 17%]
Analysis/test_resolutions.py .......                                     [ 18%]
Analysis/test_scattering.py ..................                           [ 21%]
Error: The operation was canceled.

Feb 27 '25 09:02 MBartkowiakSTFC

Suggest we might try adding a global timeout to the test so we might actually get some debug information rather than GitHub killing the runner?

Feb 27 '25 11:02 oerc0122

We just had another test that got stuck: https://github.com/ISISNeutronMuon/MDANSE/actions/runs/13585226327/job/37978614629

It was a parallel job. The most relevant part of the error output is:

parameters = {'atom_selection': '{}', 'atom_transmutation': '{}', 'frames': 'all', 'grouping_level': 'atom', ...}
traj_info = ('short_traj', '/home/runner/work/MDANSE/MDANSE/MDANSE/Tests/UnitTests/Analysis/../Converted/short_trajectory_after_changes.mdt')
job_info = ('MeanSquareDisplacement', ['msd'], False)
running_mode = ('multicore', -4), output_format = 'MDAFormat'

    @pytest.mark.parametrize("traj_info,job_info,running_mode,output_format", total_list)
    def test_dynamics_analysis(
        parameters, traj_info, job_info, running_mode, output_format
    ):
        temp_name = tempfile.mktemp()
        parameters["trajectory"] = traj_info[1]
        parameters["running_mode"] = running_mode
        parameters["output_files"] = (temp_name, (output_format,), "INFO")
        job = IJob.create(job_info[0])
>       job.run(parameters, status=True)

Analysis/test_dynamics.py:189: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/MDANSE/Framework/Jobs/IJob.py:423: in run
    IJob._runner[mode](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MDANSE.Framework.Jobs.MeanSquareDisplacement.MeanSquareDisplacement object at 0x7fe9b50782e0>

    def _run_multicore(self):
        if hasattr(self._status, "_queue_0"):
            self._status._queue_0.put("started")
    
        inputQueue = self.inputQueue
        outputQueue = self.outputQueue
        log_queue = self.log_queue
    
        log_queues = [log_queue]
        handlers = []  # handlers that are not QueueHandlers
        for handler in LOG.handlers:
            if isinstance(handler, QueueHandler):
                log_queues.append(handler.queue)
            else:
                handlers.append(handler)
    
        listener = QueueListener(log_queue, *handlers, respect_handler_level=True)
        listener.start()
    
        self._processes = []
    
        for i in range(self.numberOfSteps):
            inputQueue.put(i)
    
        for i in range(self.configuration["running_mode"]["slots"]):
            self._run_multicore_check_terminate(listener)
            p = multiprocessing.Process(
                target=self.process_tasks_queue,
                args=(inputQueue, outputQueue, log_queues),
            )
            self._processes.append(p)
            p.daemon = False
            p.start()
    
        n_results = 0
        while n_results != self.numberOfSteps:
            self._run_multicore_check_terminate(listener)
            if self._status is not None:
                self._status.fixed_status(n_results)
            try:
                index, result = outputQueue.get_nowait()
            except queue.Empty:
>               time.sleep(0.1)
E               Failed: Timeout >240.0s

Feb 28 '25 10:02 MBartkowiakSTFC