Parallelization efforts for ldmx-sw
Is your feature request related to a problem? Please describe.
When submitting jobs at many of our site, we are constrained by the number of jobs. However these machines usually have 4-8 cores, so 100 jobs could theoretically run 400-800 threads, which would significantly increase our production time.
Describe the solution you'd like
I was originally imagining multi-threading as a solution, but see alternative below:
Describe alternatives you've considered
At the sw meeting yesterday @tomeichlersmith suggested just to have a wrapper script that would run 4-8 instances of fire. That seems like a good alternative solution that certainly doesn't require as much dev time
Additional context
LDCS not working we need to be smart on how to produce samples manually
I would like to use GNU parallel since it is perfect for our use case.
I've launched a PR to build an image with parallel included https://github.com/LDMX-Software/dev-build-context/pull/150 (it's waiting for main to build with the dependabot updates I merged yesterday), but we can also run parallel outside the image (parallel denv instead of denv parallel) which is okay for testing since launching multiple containers from the same image is slightly more costly but only at start-up time.
An example of what I'm imagining is this run script. I would imagine we would basically be wrapping parallel with a defined command. i.e. the script always runs parallel fire config.py where config.py is a required argument to the script and then the rest of parallel's arguments that specify the jobs are just copied in from the user. This way we don't have to re-invent the wheel, just point folks towards parallel's excellent documentation.
For example, to do three runs (with run numbers 1, 2, and 3) for two beam energies (4.0 and 8.0 GeV), I'm imagining you'd run the script with
fire-parallel config.py ::: 1 2 3 ::: 4.0 8.0
which would spawn the following six runs of fire
fire config.py 1 4.0
fire config.py 2 4.0
fire config.py 3 4.0
fire config.py 1 8.0
fire config.py 2 8.0
fire config.py 3 8.0
so the config.py would need to be written to accept command line arguments which I think is a reasonable requirement for batch processing anyways.
Other options I'm thinking about including:
--dry-runoption for printing the jobs that would be executed--log-stdoutoption to not redirect job logs into files and instead let parallel print them to stdout when the job completes--log-diroption for where to print logs--followoption for puttingparallelinto the background have usingtailto watch its output file--signal pattern to say that everything after this point should certainly be copied toparallel
Note: We should add parallel to citations of future papers that use samples produced in this way.
Tange, O. (2025, April 22). GNU Parallel 20250422 ('Tariffs').
Zenodo. https://doi.org/10.5281/zenodo.15265748
or whatever version citation gets installed into the image we are using.
Sounds good, thanks Tom!