caffeine icon indicating copy to clipboard operation
caffeine copied to clipboard

Unit tests for `prif_stop` and `prif_error_stop` make fragile non-portable assumptions

Open bonachea opened this issue 5 months ago • 1 comments

Currently the approach taken to unit testing prif_stop and prif_error_stop is to unconditionally invoke ./build/run-fpm.sh in the fpm built Caffeine unit test, and inspecting the resulting process exit code.

I consider this entire approach to be very fragile for multiple reasons:

  1. Assumes Caffiene test executable is run from the source/build directory
  2. Assumes fpm (and possibly the compiler) are available on the compute node
  3. Assumes fpm is capable of launching parallel jobs at all
  4. Assumes parallel jobs can be launched at all (by any command) from the compute node
  5. Currently appears to have EVERY image launch the subjob
  6. Relies on process exit code propagation, which can be unreliable in loosely coupled distributed systems

I expect one or more of the above assumptions to be violated on some systems (completely breaking the Caffeine unit test) once we incorporate distributed conduits and non-trivial job spawners.

As such that we'll eventually need a "kill switch" to disable this practice, or better yet a more robust approach to exit testing that doesn't rely on programmatically invoking fom to spawn a sub-job.

bonachea avatar Sep 12 '24 03:09 bonachea