Adam Moody
Adam Moody
The ``JobLauncher`` interface for ``launch_run()`` is ambiguous in that some launchers require the list of nodes to run on (like ``aprun`` and ``mpirun``) while others take the list of nodes...
Our ``scr_run.py`` script currently launches the user job with the launcher process via ``subprocess.Popen``. There are a few challenges with this: 1) Currently, we buffer all ``stdout`` and ``stderr`` and...
To study and improve async flush performance in SCR, this extends ``test_api.c`` to execute various work kernels. By focusing on certain operations, e.g., CPU intensive, memory intensive, network intensive, etc....
When running with ``SCR_DEBUG=1``, SCR prints log messages to ``stdout``. It would be useful at times to direct those messages to other files like ``stderr`` or perhaps a user-provided file...
One can delete files from the parallel file system by either calling ``SCR_Delete()`` or by setting ``SCR_PREFIX_SIZE=N``, in which case, SCR maintains a sliding window of the ``N`` most recent...
After writing a checkpoint to the parallel file system, a later job attempts to restart. SCR detects that the checkpoint exists, but it fails when trying to fetch the files....
SCR currently allows an application to restart with a different number of ranks. However, one cannot call the SCR restart API in that case. https://scr.readthedocs.io/en/latest/users/integration.html#restart-without-scr This is awkward for applications...
After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is...
Ongoing work to improve async flush performance and usability. Related issue: https://github.com/LLNL/scr/issues/531
The keras_utils.py code requires print() to be treated as a function.