UncertaintyQuantification.jl icon indicating copy to clipboard operation
UncertaintyQuantification.jl copied to clipboard

Add batch support for SLURM job arrays

Open FriesischScott opened this issue 1 year ago • 1 comments

This PR adds a new batchsize option to the SlurmInterface with a default of 0. If left at 0 everything stays as it is now. When a batch size is specified we keep the current setup but run ceil(n / batchsize) job arrays sequentially before collecting the complete results and continuing the analysis. This is done by adding a batch id as the last arguments to the functions that generate and execute HPC jobs.

FriesischScott avatar Aug 12 '24 12:08 FriesischScott

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.12%. Comparing base (28b0867) to head (ec650d4). Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #184      +/-   ##
==========================================
+ Coverage   94.74%   95.12%   +0.38%     
==========================================
  Files          35       35              
  Lines        1465     1478      +13     
==========================================
+ Hits         1388     1406      +18     
+ Misses         77       72       -5     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 12 '24 13:08 codecov[bot]

I'm testing it on our cluster now to see if we can run 10000 samples in batches of 300 🤞.

FriesischScott avatar Aug 26 '24 16:08 FriesischScott

I generalized the hpc functions a bit more. Any future hpc scheduler should override setup_hpc_jobs and run_hpc_jobs and that's it.

FriesischScott avatar Aug 26 '24 16:08 FriesischScott

A very useful feature, if your scheduler limits the number of job submission.

I added a couple of warnings about having /test/test_utilities in path, which I believe could cause failures when testing locally.

I additionally added an option to test the package on HPC using actual slurm:

julia --project
using Pkg
Pkg.test(;test_args=["HPC", "YOUR_ACCOUNT", "YOUR_PARTITION"])

if you have the package cloned locally. Or if you have it installed from the registry, I believe the following works:

using Pkg
Pkg.test("UncertaintyQuantification"; test_args=["HPC", "YOUR_ACCOUNT", "YOUR_PARTITION"])

I also moved the work directory of the generated simulation to someplace local, as I was getting weird behaviour with temp directories.

AnderGray avatar Sep 24 '24 14:09 AnderGray