signac-flow
signac-flow copied to clipboard
Parallelism within / between groups
Feature description
tl;dr: We need a way to control parallelism within and between groups. Parallel operation within a group would be "intra-group" and parallel operation between groups would be "inter-group." This behavior would be controlled by the --parallel
flag.
Copied from Slack, adapted for brevity:
@bdice: I have two operations, equilibrate
and sample
, in a group called simulate
. Currently only the equilibrate
jobs are eligible to run. I want to run equilibrate
and then sample
, parallelized across jobs (not parallel over both operations at the same time). That is, all jobs run equilibrate
and then run simulate
when that's done. On submission, it is requesting enough resources to run equilibrate
and sample
simultaneously, instead of sequentially (while still being parallel over jobs). How do I parallelize across groups but operate sequentially within groups?
@b-butler: The current implementation by design parallelizes across both jobs and operations since given only one flag there was not a way to specify only parallelizing part of the submission. We could implement more mutually exclusive flags or make --parallel
have a default value.
Proposed solution
Proposal for API from @csadorf:
--parallel=none
: No parallel execution, default when no option is provided.
--parallel=inter
: Parallel execution across, but not within groups; default when only --parallel is provided.
--parallel=intra
: Parallel execution within groups, but not across.
--parallel=all
: Parallel execution within and across groups.
I would like to update my proposal:
-
--parallel=none
: No parallel execution, default when no option is provided. -
--parallel=inter-groups
: Parallel execution between, but not within groups; default when only --parallel is provided. -
--parallel=intra-groups
: Parallel execution within groups, but not between. -
--parallel=full
: Parallel execution between and within groups.
hi, I would like to work on this issue?
I have talked with @ac-optimus on Slack about this issue and there are some good first steps that could be taken. Some suggestions for implementation:
- The parallelism has to be handled by both the "submit" logic (which handles parallelism between groups) and the "run" logic (which handles parallelism within a group).
- Resources requests should use
max
orsum
appropriately. If you're running a group's operations in parallel, that group needs to request thesum
of its resources. Likewise, a group run in serial needs themax
of its resources. If you're running multiple groups in parallel, the total request is thesum
of all groups' resources. Likewise, multiple groups run in serial need to request themax
over all groups' resources. That is,--parallel=none
should request something likemax(max(op for op in group) for group in groups)
for each resource (GPUs, number of processors, etc.). In the same way,--parallel=inter-groups
would besum(max(...))
,--parallel=intra-groups
would bemax(sum(...))
,--parallel=full
would besum(sum(...))
. - Copy the behavior of #209 in how it implements the options as an
IntEnum
. - Test this out by overriding the default environment with a SLURM environment and generating a script. It may be easier to check the output after we resolve #252.
@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.
Calculating resources for this is non-trivial. How does this relate to #265 ?
@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.
okay sure, will update you on this very soon!
I think that this logic for directives aggregation should take two steps. We need to aggregate directives within a group according to serial or parallel, and after that we need to aggregate again with respect to the inter-group
parallelization (or lack thereof).
With respect to #265 @csadorf, maybe this issue is another reason to centralize this directives logic to reduce code surface area and the fact that the two aggregations are identical though different in scope.
I could see the implementation of this more complete aggregation logic being the first step in addressing #265 by separating the logic from the FlowGroup
and templates (the templates would still need the total directives, but there is no point in duplicating the aggregation logic).
Agreed, refactoring the directives logic must be the first step.
And honestly, it should not be that hard. We just need to allow for customization by environments.
So would the ability to aggregate directives be found in the ComputeEnvironment
class?
Or at least it should be associated with a default entity.
Yes, refactoring directives seems like a good place to start.
(To clarify, the use of the word "aggregation" here in the context of determining resource requests over a set of flow groups is not related to the "aggregation" feature discussed in #52.)