software-layer failing tests in ESPResSo v4.2.1 due to timeouts

Some tests in the ESPResSo v4.2.1 test suite are known to be flaky, and sometimes hang, for example:

https://github.com/espressomd/espresso/issues/4639

We ran into similar problem when building ESPResSo v4.2.1 for EESSI pilot 2023.06, cfr. #331 .

Oct 11 '23 06:10 boegel

As mentioned in https://github.com/EESSI/software-layer/pull/331#issuecomment-1756965079, it seems that hitting hanging tests is more likely on aarch64/neoverse_v1, since we didn't see hanging tests when building for the other CPU targets, but that could be dumb luck...

@jngrad Does this happen to ring any bells for you? Are you seeing hanging tests more often on certain platforms?

Oct 11 '23 07:10 boegel

I've added a hook to #331 ignore the failing tests in ESPResSo v4.2.1 if they occur, that's the best we can do for now (other than not running the test suite at all, which is not a good idea imho), and updated the list of known issues to include this tracker issue, so we can get ESPResSo deployed in EESSI 2023.06...

Oct 11 '23 07:10 boegel

From my experience on Fedora Koji, our test cases aren't more prone to failure on neoverse compared to x86_64. However on architectures other than ARM and x86_64, we do see a lot of variability. For example when packaging on openSUSE, we ended up disabling every architecture but x86_64. See openSUSE:Factory/python3-espressomd and click on "Show 17 excluded/disabled results" to see the list.

Oct 11 '23 09:10 jngrad

Here are all statistical tests: mass-and-rinertia_per_particle, rotational-diffusion-aniso, integrator_npt_stats.py, constant_pH_stats, langevin_thermostat_stats, brownian_dynamics_stats.py, dpd_stats, stokesian_dynamics.

They are known to take a large amount of time on our CI pipelines, because we run them concurrently and max out the host machine CPU resource usage via MPI oversubscription, so that hyperthreaded cores are fully used. This makes their runtime fluctuate wildly, with a negative feedback loop since they compete against one another for the same resources (e.g. if one test times out, there is a very good chance another unrelated test will time out too). More details can be found in espressomd/espresso#3883.

Having said that, you don't seem to run these tests concurrently, so your CI pipelines should not be experiencing the issue I just described. Maybe there is a deeper issue in ESPResSo's MPI code, unfortunately timeout information alone is not sufficient for me to investigate an MPI issue.

Oct 11 '23 09:10 jngrad

Interestingly, this problem did not pop up for the installation of ESPResSo v4.2.1 with foss/2023a in software.eessi.io, see #455 ...

Jan 19 '24 07:01 boegel

Interestingly, this problem did not pop up for the installation of ESPResSo v4.2.1 with foss/2023a in software.eessi.io, see #455 ...

Scratch that, that's incorrect. We have a hook in place to ignore failing tests on neoverse_v1, we are still seeing timeouts (only for that CPU target). For ESPResSo/4.2.1-foss-2023a:

The following tests FAILED:
          4 - test_checkpoint__therm_lb__p3m_cpu__lj__lb_cpu_ascii (Failed)
         34 - accumulator_correlator (Timeout)
         48 - interactions_bond_angle (Timeout)
         65 - rotation_per_particle (Timeout)
         66 - rotational_inertia (Timeout)
         71 - reaction_ensemble (Timeout)
         77 - canonical_ensemble (Timeout)
        100 - integrator_npt (Failed)
        101 - integrator_npt_stats (Failed)
        111 - lb_stats (Timeout)
        116 - dpd_stats (Timeout)
        124 - collision_detection (Timeout)
        151 - thermostats_anisotropic (Timeout)
        162 - lb_interpolation (Timeout)
        164 - oif_volume_conservation (Timeout)
        167 - lb_boundary (Timeout)

Jan 19 '24 07:01 boegel