flux-docs
flux-docs copied to clipboard
Add examples of bootstrapping under sbatch
As @Larofeticus pointed out, most (all?) of our examples involve working with Flux in an interactive manner. In particular, they all use salloc to grab a set of nodes and then invoke Flux commands interactively. It would be instructive to have an example where we create a script that bootstraps Flux (and invokes a Flux initial program) and we submit that script with sbatch to show how the whole workflow would work in batch mode.
Perhaps we can repurpose our batch job examples for workflow example section...: https://flux-framework.readthedocs.io/en/latest/batch.html
Thanks @dongahn! That's almost exactly what was needed. As a starter, we should give the batch script in that example a name and include the sbatch ./scriptname.sh line explicitly. That way, if someone searches for sbatch, the example will pop up. @larofeticus searched for sbatch but the search unfortunately didn't match on the #SBATCH pragma.
@Larofeticus and @SteVwonder:
We discussed this at one of our coffee hours, we can easily add this example, but we would also be very happy to work with the user (@Larofeticus) if an initial PR is proposed based on the existing examples at https://flux-framework.readthedocs.io/en/latest/batch.html
My opinion is that users know the best what they want to see from a page then a developer.
Let me know what you think.
I'm happy to help the process here.
I suppose the first specific thing i've found is that --mpibind=none is not a valid srun flag on Cori.
The generalized form of that is a gentle reminder to avoid including site specific features in the documentation. Balsam also had (but much more) trouble with this: Being tightly coupled to Cobalt and specific Argonne machine configurations.
I'm happy to help the process here.
Great!
I suppose the first specific thing i've found is that --mpibind=none is not a valid srun flag on Cori.
Ah... yes that is a Livermore specific flag!
Maybe to help get your feet wet: if you propose a small PR that includes your proposed fix for a single topic (i.e., site specificity note for mpibind=none), we will review and help that to be merged.
One piece of information I don't have is: "What is the consequence of removing that flag when using a system that does have mpibind?" Does the example still work as intended?
If the site uses other ways to bind Flux brokers to a subset of cores/gpus, no the examples won't work. In that case, Flux will only schedule the subset of resources.
Does NERSC have any srun option to ensure binding is not happening? Or srun is by default not binding at all. If latter, Flux should work out of box without mpibind=none.
Suggestion: As part of the docs include a "sanity check" command that a new user can run to verify that Flux has discovered all expected resources (e.g. run flux resource list, if it appears that Flux has not discovered all expected resources, then it may be that the native launcher on your system has restricted the resources available to the flux broker processes. Check for site-specific options such as --mpibind and be sure to disable them)