signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Stampede2 bundled submissions

Open DomFijan opened this issue 4 years ago • 11 comments

Description

Submission of bundles on stampede2 produces wrong submission script. Might be a template issue.

To reproduce

python3 project.py submit -n 3 -b 3 -w 0.2 --parallel --force --pretend

Error output

On stampede2 bundeling is done thorugh the iburn and if bundles are submitted an offest must be provided with -o. The old version of signac-flow (0.8) produced the correct command :

# equilibrate(662e116bc54baa4fbc7a00991e45c6fa)
ibrun -n 16 -o 0 task_affinity singularity exec software.simg python3 project.py exec equilibrate 662e116bc54baa4fbc7a00991e45c6fa &
# equilibrate(92dec1dd2f231126cd7aa1c57d7b324c)
ibrun -n 16 -o 16 task_affinity singularity exec software.simg python3 project.py exec equilibrate 92dec1dd2f231126cd7aa1c57d7b324c &
# equilibrate(159dee0e06c1defdd968cb473a786585)
ibrun -n 16 -o 32 task_affinity singularity exec software.simg python3 project.py exec equilibrate 159dee0e06c1defdd968cb473a786585 &

while the new version produces (pulled from master on 01/22/2021):

# equilibrate(52d99c004c293079cbc490ce1857f271)
_FLOW_STAMPEDE_OFFSET_=0 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j 52d99c004c293079cbc490ce1857f271 &
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate 52d99c004c293079cbc490ce1857f271
# equilibrate(2a5a7766066be122f7c47d02df18981f)
_FLOW_STAMPEDE_OFFSET_=16 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j 2a5a7766066be122f7c47d02df18981f &
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate 2a5a7766066be122f7c47d02df18981f
# equilibrate(d550f0dc75f8ba59a732cff4f9aa391d)
_FLOW_STAMPEDE_OFFSET_=32 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j d550f0dc75f8ba59a732cff4f9aa391d &
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate d550f0dc75f8ba59a732cff4f9aa391d

-o argument of iburn is always 0 in newest version while it should be 0, 16, 32.

System configuration

  • Operating System [e.g. macOS]:Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Version of Python [e.g. 3.7]:3.7.0 (default, Feb 6 2019, 21:24:19) [GCC Intel(R) C++ gcc 6.3 mode]
  • Version of signac [e.g. 1.0]:1.5.1
  • Version of signac-flow:0.11.0 Both signac and signac flow were pulled from master on 01/22/2021 and installed with pip.

DomFijan avatar Jan 22 '21 20:01 DomFijan

It's possible that this is just the # Eligible to run: block ignoring the environment variable _FLOW_STAMPEDE_OFFSET_. The script may appear to be wrong but may actually do the right thing when executed. @DomFijan Have you checked to see if it works correctly when submitted?

bdice avatar Jan 22 '21 21:01 bdice

It's possible that this is just the # Eligible to run: block ignoring the environment variable _FLOW_STAMPEDE_OFFSET_. The script may appear to be wrong but may actually do the right thing when executed. @DomFijan Have you checked to see if it works correctly when submitted?

When I do the actual submit without --pretend I get the following error:

ERROR: Encountered error during program execution: '_call_submit() missing 1 required positional argument: 'pretend''

When I switch back to flow 0.8 submit works normally... I doubt this is connected though.

DomFijan avatar Jan 22 '21 21:01 DomFijan

@DomFijan I think I fixed the submission error in #438. Can you try that PR and verify? The offsets may still appear incorrectly. I am looking into that next.

bdice avatar Jan 23 '21 22:01 bdice

(Quoting comment from #438, which I should have written on this issue: https://github.com/glotzerlab/signac-flow/pull/438#issuecomment-766533674)

@DomFijan I resolved the issue with submission and I'm going to merge this PR immediately to prevent the problems from occurring for other testers. I expect that the issue with _FLOW_STAMPEDE_OFFSET_ still exists but I think it's only an issue in the printed output for what "should" run and is not an issue that will affect real submissions' execution. I don't know how to verify that behavior -- if you have a workflow that can test it, it would be good to check before releasing (the printed output of what "should" run was probably incorrect in previous releases as well). We can discuss that issue further in #437 now that the general submission process should be working again.

Tagging @b-butler @vyasr since I think you discussed & fixed this for Stampede2 in #250, #298.

bdice avatar Jan 25 '21 04:01 bdice

(Quoting comment from #438, which I should have written on this issue: #438 (comment))

@DomFijan I resolved the issue with submission and I'm going to merge this PR immediately to prevent the problems from occurring for other testers. I expect that the issue with _FLOW_STAMPEDE_OFFSET_ still exists but I think it's only an issue in the printed output for what "should" run and is not an issue that will affect real submissions' execution. I don't know how to verify that behavior -- if you have a workflow that can test it, it would be good to check before releasing (the printed output of what "should" run was probably incorrect in previous releases as well). We can discuss that issue further in #437 now that the general submission process should be working again.

Tagging @b-butler @vyasr since I think you discussed & fixed this for Stampede2 in #250, #298.

The submit is now fixed and I've submitted as you suggested. I extracted the submitted script via scontrol write batch_script and I get the exactly same output as with pretend:

# equilibrate(52d99c004c293079cbc490ce1857f271)
_FLOW_STAMPEDE_OFFSET_=0 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j 52d99c004c293079cbc490ce1857f271
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate 52d99c004c293079cbc490ce1857f271

# equilibrate(2a5a7766066be122f7c47d02df18981f)
_FLOW_STAMPEDE_OFFSET_=16 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j 2a5a7766066be122f7c47d02df18981f
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate 2a5a7766066be122f7c47d02df18981f

# equilibrate(d550f0dc75f8ba59a732cff4f9aa391d)
_FLOW_STAMPEDE_OFFSET_=32 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j d550f0dc75f8ba59a732cff4f9aa391d
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate d550f0dc75f8ba59a732cff4f9aa391d

The job starts normally but is very very slow. I re-checked this with a simulation I've done previously. I and reran it with the newest flow version and shows exact same slow down.

Flow version in the container is 0.11.

DomFijan avatar Jan 25 '21 13:01 DomFijan

Tagging @b-butler @vyasr since I think you discussed & fixed this for Stampede2 in #250, #298.

Upon reading #298 and #250 the produced script is in accordance to the fixes implemented there. Although I find it very unintuitive that the line that gets executed looks nothing like one would expect? Perhaps adding a line that says "the above is equivalent to the below commented line" would help? Something like:

# equilibrate(d550f0dc75f8ba59a732cff4f9aa391d)
_FLOW_STAMPEDE_OFFSET_=32 /opt/apps/intel18/python3/3.7.0/bin/python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py run -o equilibrate -j d550f0dc75f8ba59a732cff4f9aa391d
# THE ABOVE IS EQUIVALENT TO BELOW
# Eligible to run:
# ibrun -n 16 -o 0 task_affinity  singularity exec software.simg python3 /scratch/07885/dfijan/TRIANGLES0_25/project.py exec equilibrate d550f0dc75f8ba59a732cff4f9aa391d

DomFijan avatar Jan 25 '21 13:01 DomFijan

@bdice yes, the offsets don't get correctly incremented in pretend mode but they should when actually submitting. I don't remember the exact reason for this, but it's basically because of the different way that environment classes are used during run vs submit. The Stampede2Environment maintains an internal offset counter, which is always tacked onto the value of the environment variable and incremented within a given run loop. Since submit calls run and run forks (when using MPI), the environment variable allows communication of the current offset between the parent and child processes.

During a pretend submission the internal offset variable calculation won't actually see the new environment variable. I think (this is the detail I don't 100% remember but could look into if needed) that this happens because the environment variable is only loaded when the class itself is created (or equivalently, when the module is loaded), which makes sense since these classes are not designed to be instantiated, but during pretend submission since there is no forking happening the class is only created once and so it never sees the environment variable. IIRC fixing this issue would require substantially more convoluted logic than is worth implementing.

@DomFijan If I understand you correctly, you find it confusing that the command run at submission time is different from the operation you submitted, whereas the actual operation you want to submit is shown below the "Eligible to run" header. Assuming I understand you correctly, your suggestion is unfortunately not really accurate. However, I'll try to explain what's happening and maybe you can suggest alternative ways to improve our current output.

This distinction is not just a shortcoming in our current code, but a consequence of a core feature of our execution model. Say you have a sequence of operations A->B->C that are all part of a group G, and initially only A is eligible. If you submit a group G, the submission operation will be something like

python project.py run -o G -j ...
# Eligible to run:
# python project.py run -o A -j ...
# Operations with unmet preconditions:
# python project.py run -o B -j ...
# python project.py run -o C -j ...

A priori, there is no way to know whether running the first operation (A) will actually lead to B becoming eligible, so we simply print our best guess for what happens. As a result, the comment you suggest adding wouldn't really be accurate, because the two aren't equivalent. The "Eligible to run" section is the set of things that we know will run, but the "Operations with unmet preconditions" section (and the "Operations with all postconditions met", which could have postconditions invalidated by running other operations), represent operations that might run, but there's no way to know for sure at submit time because they will be reevaluated after the "Eligible to run" operations are completed. Does that make sense?

vyasr avatar Jan 25 '21 17:01 vyasr

@bdice I agree with what Vyas just posted. Having now looked at this, I don't think it is worth the effort necessary to fix the pretend output to have more accurate offsets. It is somewhat confusing, but in this case given our execution model somewhat unavoidable.

@DomFijan also your line to add to the submission script is not quite accurate. The submission is not equivalent to the lines below. It depends as Vyas said on preconditions and postconditions as well as in cases like this the use of environmental variables such that even given the pretend output the offset are generated correctly.

b-butler avatar Jan 26 '21 14:01 b-butler

@bdice yes, the offsets don't get correctly incremented in pretend mode but they should when actually submitting. I don't remember the exact reason for this, but it's basically because of the different way that environment classes are used during run vs submit. The Stampede2Environment maintains an internal offset counter, which is always tacked onto the value of the environment variable and incremented within a given run loop. Since submit calls run and run forks (when using MPI), the environment variable allows communication of the current offset between the parent and child processes.

During a pretend submission the internal offset variable calculation won't actually see the new environment variable. I think (this is the detail I don't 100% remember but could look into if needed) that this happens because the environment variable is only loaded when the class itself is created (or equivalently, when the module is loaded), which makes sense since these classes are not designed to be instantiated, but during pretend submission since there is no forking happening the class is only created once and so it never sees the environment variable. IIRC fixing this issue would require substantially more convoluted logic than is worth implementing.

@DomFijan If I understand you correctly, you find it confusing that the command run at submission time is different from the operation you submitted, whereas the actual operation you want to submit is shown below the "Eligible to run" header. Assuming I understand you correctly, your suggestion is unfortunately not really accurate. However, I'll try to explain what's happening and maybe you can suggest alternative ways to improve our current output.

This distinction is not just a shortcoming in our current code, but a consequence of a core feature of our execution model. Say you have a sequence of operations A->B->C that are all part of a group G, and initially only A is eligible. If you submit a group G, the submission operation will be something like

python project.py run -o G -j ...
# Eligible to run:
# python project.py run -o A -j ...
# Operations with unmet preconditions:
# python project.py run -o B -j ...
# python project.py run -o C -j ...

A priori, there is no way to know whether running the first operation (A) will actually lead to B becoming eligible, so we simply print our best guess for what happens. As a result, the comment you suggest adding wouldn't really be accurate, because the two aren't equivalent. The "Eligible to run" section is the set of things that we know will run, but the "Operations with unmet preconditions" section (and the "Operations with all postconditions met", which could have postconditions invalidated by running other operations), represent operations that might run, but there's no way to know for sure at submit time because they will be reevaluated after the "Eligible to run" operations are completed. Does that make sense?

Thanks for explaining @vyasr ! That does indeed make sense. I got tunnel visioned on the particular script I use. I forgot to take the grander context of flow groups into account. You are right. The thing that might be useful to have for stampede2 submissions in particular is to somehow let the user know that -o 0 in "Eligible to run" and subsequent sections is nothing to worry about when --pretend is used.

I will try and do some additional benchmarking on slow-down of execution of the code on the nodes when submitted with newest version vs. 0.8 version later today.

DomFijan avatar Jan 26 '21 14:01 DomFijan

I have tested the stampede2 submission with following versions:

  1. signac 1.2 and flow 0.8 which I shall refer to as "old"
  2. signac 1.6 and newest version of flow pulled from the github repo master which I'll refer to as "new"

Stampede has 48 cores per node. nranks = 16 in the directive I tested with following base command: python3 project.py submit -n 3 -b 3 -w 0.5 Jobs are executed in a container via singularity which has flow 0.11 version installed. I always tested the same 3 jobs starting from same starting point each time. In both new and old versions the submission without --parallel is not allowed by signac-flow but can be forced through --force. I performed submissions with and without --parallel and --force. The results are following (in minutes): ver | --force | --parallel old | 8:10 | 3:49 new | 8:05 | 312 (this is 5 hours) When bundling is forced without --parallel the execution time is very similar. When using --parallel (no forcing is needed) the old version starts the runs with correct stampede2 (iburn) offsets and finishes >2 times faster. However the new version slows down considerably. This might be indicative that something is wrong with offsets when bundling in parallel. @vyasr @b-butler Is there anything I might be doing wrong here?

Another confusing issue is that when submitting jobs with --parallel following warning is issued for both new and old versions:

WARNING:flow.util.template_filters:Bundled submission without MPI on Stampede2 is using launcher; the --parallel option is therefore ignored.

DomFijan avatar Jan 26 '21 17:01 DomFijan

@b-butler could you have a look at this? I just realized that we've let this slide, and I don't have time to look into it right now. Offhand I'd guess that this could be related to #270, but I haven't looked at his project operations at all to substantiate this.

vyasr avatar Feb 26 '21 20:02 vyasr