ldmx-sw icon indicating copy to clipboard operation
ldmx-sw copied to clipboard

Parallelization efforts for ldmx-sw

Open tvami opened this issue 1 month ago • 2 comments

Is your feature request related to a problem? Please describe.

When submitting jobs at many of our site, we are constrained by the number of jobs. However these machines usually have 4-8 cores, so 100 jobs could theoretically run 400-800 threads, which would significantly increase our production time.

Describe the solution you'd like

I was originally imagining multi-threading as a solution, but see alternative below:

Describe alternatives you've considered

At the sw meeting yesterday @tomeichlersmith suggested just to have a wrapper script that would run 4-8 instances of fire. That seems like a good alternative solution that certainly doesn't require as much dev time

Additional context

LDCS not working we need to be smart on how to produce samples manually

tvami avatar Oct 30 '25 15:10 tvami

I would like to use GNU parallel since it is perfect for our use case.

I've launched a PR to build an image with parallel included https://github.com/LDMX-Software/dev-build-context/pull/150 (it's waiting for main to build with the dependabot updates I merged yesterday), but we can also run parallel outside the image (parallel denv instead of denv parallel) which is okay for testing since launching multiple containers from the same image is slightly more costly but only at start-up time.

An example of what I'm imagining is this run script. I would imagine we would basically be wrapping parallel with a defined command. i.e. the script always runs parallel fire config.py where config.py is a required argument to the script and then the rest of parallel's arguments that specify the jobs are just copied in from the user. This way we don't have to re-invent the wheel, just point folks towards parallel's excellent documentation.

For example, to do three runs (with run numbers 1, 2, and 3) for two beam energies (4.0 and 8.0 GeV), I'm imagining you'd run the script with

fire-parallel config.py ::: 1 2 3 ::: 4.0 8.0

which would spawn the following six runs of fire

fire config.py 1 4.0
fire config.py 2 4.0
fire config.py 3 4.0
fire config.py 1 8.0
fire config.py 2 8.0
fire config.py 3 8.0

so the config.py would need to be written to accept command line arguments which I think is a reasonable requirement for batch processing anyways.

Other options I'm thinking about including:

  • --dry-run option for printing the jobs that would be executed
  • --log-stdout option to not redirect job logs into files and instead let parallel print them to stdout when the job completes
  • --log-dir option for where to print logs
  • --follow option for putting parallel into the background have using tail to watch its output file
  • -- signal pattern to say that everything after this point should certainly be copied to parallel

Note: We should add parallel to citations of future papers that use samples produced in this way.

Tange, O. (2025, April 22). GNU Parallel 20250422 ('Tariffs').
Zenodo. https://doi.org/10.5281/zenodo.15265748

or whatever version citation gets installed into the image we are using.

tomeichlersmith avatar Oct 30 '25 15:10 tomeichlersmith

Sounds good, thanks Tom!

tvami avatar Oct 30 '25 16:10 tvami