signac-flow Parallelism within / between groups

Feature description

tl;dr: We need a way to control parallelism within and between groups. Parallel operation within a group would be "intra-group" and parallel operation between groups would be "inter-group." This behavior would be controlled by the --parallel flag.

Copied from Slack, adapted for brevity:

@bdice: I have two operations, equilibrate and sample, in a group called simulate. Currently only the equilibrate jobs are eligible to run. I want to run equilibrate and then sample, parallelized across jobs (not parallel over both operations at the same time). That is, all jobs run equilibrate and then run simulate when that's done. On submission, it is requesting enough resources to run equilibrate and sample simultaneously, instead of sequentially (while still being parallel over jobs). How do I parallelize across groups but operate sequentially within groups?

@b-butler: The current implementation by design parallelizes across both jobs and operations since given only one flag there was not a way to specify only parallelizing part of the submission. We could implement more mutually exclusive flags or make --parallel have a default value.

Proposed solution

Proposal for API from @csadorf: --parallel=none: No parallel execution, default when no option is provided. --parallel=inter: Parallel execution across, but not within groups; default when only --parallel is provided. --parallel=intra: Parallel execution within groups, but not across. --parallel=all: Parallel execution within and across groups.

Mar 11 '20 15:03 bdice

I would like to update my proposal:

--parallel=none: No parallel execution, default when no option is provided.
--parallel=inter-groups: Parallel execution between, but not within groups; default when only --parallel is provided.
--parallel=intra-groups: Parallel execution within groups, but not between.
--parallel=full: Parallel execution between and within groups.

Mar 11 '20 17:03 csadorf

hi, I would like to work on this issue?

Mar 11 '20 17:03 ac-optimus

I have talked with @ac-optimus on Slack about this issue and there are some good first steps that could be taken. Some suggestions for implementation:

The parallelism has to be handled by both the "submit" logic (which handles parallelism between groups) and the "run" logic (which handles parallelism within a group).
Resources requests should use max or sum appropriately. If you're running a group's operations in parallel, that group needs to request the sum of its resources. Likewise, a group run in serial needs the max of its resources. If you're running multiple groups in parallel, the total request is the sum of all groups' resources. Likewise, multiple groups run in serial need to request the max over all groups' resources. That is, --parallel=none should request something like max(max(op for op in group) for group in groups) for each resource (GPUs, number of processors, etc.). In the same way, --parallel=inter-groups would be sum(max(...)), --parallel=intra-groups would be max(sum(...)), --parallel=full would be sum(sum(...)).
Copy the behavior of #209 in how it implements the options as an IntEnum.
Test this out by overriding the default environment with a SLURM environment and generating a script. It may be easier to check the output after we resolve #252.

Mar 11 '20 18:03 bdice

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

Mar 11 '20 18:03 bdice

Calculating resources for this is non-trivial. How does this relate to #265 ?

Mar 11 '20 18:03 csadorf

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

okay sure, will update you on this very soon!

Mar 11 '20 18:03 ac-optimus

I think that this logic for directives aggregation should take two steps. We need to aggregate directives within a group according to serial or parallel, and after that we need to aggregate again with respect to the inter-group parallelization (or lack thereof).

With respect to #265 @csadorf, maybe this issue is another reason to centralize this directives logic to reduce code surface area and the fact that the two aggregations are identical though different in scope.

I could see the implementation of this more complete aggregation logic being the first step in addressing #265 by separating the logic from the FlowGroup and templates (the templates would still need the total directives, but there is no point in duplicating the aggregation logic).

Mar 11 '20 18:03 b-butler

Agreed, refactoring the directives logic must be the first step.

Mar 11 '20 18:03 csadorf

And honestly, it should not be that hard. We just need to allow for customization by environments.

Mar 11 '20 18:03 csadorf

So would the ability to aggregate directives be found in the ComputeEnvironment class?

Mar 11 '20 18:03 b-butler

Or at least it should be associated with a default entity.

Mar 11 '20 18:03 csadorf

Yes, refactoring directives seems like a good place to start.

(To clarify, the use of the word "aggregation" here in the context of determining resource requests over a set of flow groups is not related to the "aggregation" feature discussed in #52.)

Mar 11 '20 18:03 bdice

signac-flow signac-flow copied to clipboard

Parallelism within / between groups

Feature description

Proposed solution

signac-flow
signac-flow copied to clipboard