dscr
dscr copied to clipboard
Integration with batchJobs
We've had various discussions about how to provide better support for long-running jobs. To me it seems that by making use of batchJobs, and particularly its waitForJobs function, we should be able to get something that works with relatively little code.
Currently we have, in run_dsc, the code:
runScenarios(dsc,scenariosubset,seedsubset) runMethods(dsc,scenariosubset,methodsubset,seedsubset) runOutputParsers(dsc) runScores(dsc,scenariosubset,methodsubset)
The simplest approach that I can see would involve submitting jobs to do each of these functions, and using waitForJobs to wait between each job set.
runScenarios (by submitting to batchJobs) waitForJobs() runMethods (again through batchJobs, a second registry of jobs this) waitForJobs() runOutputparsers (again through batchJobs, a third registry) waitForJobs() runScores (batchJobs, a fourth registry) waitForJobs()
@ramanshah is there a reason you can see that this would not work?
I've thought about this strategy. Here are my reasons for hesitation:
- Some clusters have an enormous amount of latency between job submission and job execution. I've done a lot of my past research on clusters where the wait between job submission and the beginning of job execution tends to run in the 1-4 day range. Quadrupling such latency would be painful.
- I know for some fields (e.g. in quantitative finance; Rick's experiences seem to agree with this) that a really slow "brute force" methodology involving some pedantic Monte Carlo simulation is often the benchmark for the more clever methodology. This might also be true for our genomic work but I'm not sure. A single extremely slow method in a dsc would hold up all of the faster methods.
- As you have mentioned in other issues, it is likely that input parsers and the use of multiple pre/post processing steps could make the maximum number of global barriers (
waitForJobs()
) even larger.
If you feel these aren't important, we can definitely do it this way. Your suggestion is probably the simplest implementation.
I think 1 is presumably going to depend on the cluster environment, but it isn't a problem I have come across in practice with the clusters we are using.
For 2 this scenario is indeed not out of the question, but easily dealt with: first run your dsc for all the fast methods. Then add the slow method and run that.
For 3, I agree, but actually suspect that in most use cases it will be the methods that are the rate-determining step, not the waitForJobs() on parsers etc.
I think the issue is urgent enough, and this approach simple enough, that we would be best off implementing it first, and seeing what our next bottleneck turns out to be.
Sounds good.
Probably addresses #23 as well.