maestrowf icon indicating copy to clipboard operation
maestrowf copied to clipboard

long jobs launched w/o srun confuse maestro

Open doutriaux1 opened this issue 4 years ago • 7 comments

I have a study that starts a bunch of python jobs. the maestro run part went fine. Then I issue a maestro status and it tells me something probably went wrong. Eventually I realized some of my jobs were actually running just slow.

Also the study has 2 sets of jobs that it can start independently at the beginning. But somehow the second set is not even generatting the command lines script, apparently waiting the the first set of jobs to finish

Screen output

[2020-06-04 07:06:57: WARNING] WARNING Logging Level -- Enabled
[2020-06-04 07:06:57: CRITICAL] CRITICAL Logging Level -- Enabled
[2020-06-04 07:06:57: INFO] Loading specification -- path = process_hohlraum_post.yaml
[2020-06-04 07:06:57: INFO] Directory does not exist. Creating directories to /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657/logs
[2020-06-04 07:06:57: INFO] Loading custom parameter generator from '/usr/WS1/aml_cs/ALE/LAGER/data-generation/Hohlraum/maestro_custom_generator.py'
P: LASPOWERMULT [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
P: MINIMALALE [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
P: RESOLUTION [0.15, 0.15, 0.25, 0.25, 0.5, 0.5, 0.15, 0.15, 0.25, 0.25, 0.5, 0.5]
P: DOMAIN [18, 18, 72, 72, 288, 288, 18, 18, 72, 72, 288, 288]
P: RHO_FOAM2 [2e-06, 0.35, 2e-06, 0.35, 2e-06, 0.35, 2e-06, 0.35, 2e-06, 0.35, 2e-06, 0.35]
P: PROC [72, 72, 288, 288, 1152, 1152, 72, 72, 288, 288, 1152, 1152]
P: NODES [2, 2, 8, 8, 32, 32, 2, 2, 8, 8, 32, 32]
P: PROCS_XENA [18, 18, 36, 36, 72, 72, 18, 18, 36, 36, 72, 72]
P: NODES_XENA [1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2]
[2020-06-04 07:06:58: INFO] Adding step 'nodes' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] Adding step 'relax' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] relax is dependent on nodes. Creating edge (nodes, relax)...
[2020-06-04 07:06:58: INFO] Adding step 'zones' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] zones is dependent on relax. Creating edge (relax, zones)...
[2020-06-04 07:06:58: INFO] Adding step 'movies' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] Adding step 'kosh' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] kosh is dependent on nodes. Creating edge (nodes, kosh)...
[2020-06-04 07:06:58: INFO] kosh is dependent on relax. Creating edge (relax, kosh)...
[2020-06-04 07:06:58: INFO] kosh is dependent on zones. Creating edge (zones, kosh)...
[2020-06-04 07:06:58: INFO] kosh is dependent on movies. Creating edge (movies, kosh)...
[2020-06-04 07:06:58: INFO] Adding step 'directory_permissions' to study 'generate_hohlraum'...
[2020-06-04 07:06:58: INFO] directory_permissions is dependent on zones. Creating edge (zones, directory_permissions)...
[2020-06-04 07:06:58: INFO] directory_permissions is dependent on movies. Creating edge (movies, directory_permissions)...
[2020-06-04 07:06:58: INFO]
------------------------------------------
Submission attempts =       1
Submission restart limit =  1
Submission throttle limit = 0
Use temporary directory =   False
Hash workspaces =           False
Dry run enabled =           False
Output path =               /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657
------------------------------------------
Would you like to launch the study? [yn] y
Study launched successfully.
(kosh) [cdoutrix@rztopaz188:Hohlraum]$ maestro status /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657
[2020-06-04 07:07:52: INFO] INFO Logging Level -- Enabled
[2020-06-04 07:07:52: WARNING] WARNING Logging Level -- Enabled
[2020-06-04 07:07:52: CRITICAL] CRITICAL Logging Level -- Enabled
Status check failed. If the issue persists, please verify thatyou are passing in a path to a study.

ps command that shows jobs are actually running

(kosh) [cdoutrix@rztopaz188:Hohlraum]$ ps -aux | grep cdou
root      8039  0.0  0.0 154492  5304 ?        Ss   06:46   0:00 sshd: cdoutrix [priv]
cdoutrix  8066  0.0  0.0 156576  3120 ?        S    06:46   0:00 sshd: cdoutrix@pts/29
cdoutrix  8067  0.0  0.0 119808  6376 pts/29   Ss   06:46   0:00 -bash
cdoutrix 16004  0.0  0.0   9576  1144 pts/29   S    07:02   0:00 /bin/sh -c nohup conductor -t 60 -d 2 /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070213 > /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070213/generate_hohlraum.txt 2>&1
cdoutrix 16005  0.1  0.0  69980 27160 pts/29   S    07:02   0:00 /g/g19/cdoutrix/miniconda3/envs/kosh/bin/python /g/g19/cdoutrix/miniconda3/envs/kosh/bin/conductor -t 60 -d 2 /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070213
cdoutrix 17855  0.0  0.0   9576  1144 pts/29   S    07:07   0:00 /bin/sh -c nohup conductor -t 60 -d 2 /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657 > /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657/generate_hohlraum.txt 2>&1
cdoutrix 17856  0.4  0.0  69984 30944 pts/29   S    07:07   0:00 /g/g19/cdoutrix/miniconda3/envs/kosh/bin/python /g/g19/cdoutrix/miniconda3/envs/kosh/bin/conductor -t 60 -d 2 /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657
cdoutrix 18847  0.0  0.0   9584  1504 pts/29   S    07:08   0:00 /bin/bash /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070213/movies/domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06/movies_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06.slurm.sh
cdoutrix 18884  0.1  0.0  52044  8456 pts/29   S    07:08   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s /usr/workspace/aml_cs/ALE/LAGER/data-generation/Hohlraum/view_variable_visit_at_time_or_cycle.py --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations
cdoutrix 18894  0.5  0.0 384472 32472 pts/29   Sl   07:08   0:00 /usr/gapps/visit/3.1.2/linux-x86_64/bin/cli -nowin -no-launch-x -s /usr/workspace/aml_cs/ALE/LAGER/data-generation/Hohlraum/view_variable_visit_at_time_or_cycle.py --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations
cdoutrix 18895  0.1  0.0  52044  8472 pts/29   S    07:08   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -viewer -noint -nowin -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key e8b214d3ea04131074e6
cdoutrix 18905  3.1  0.0 963288 94452 pts/29   Sl   07:08   0:01 /usr/gapps/visit/3.1.2/linux-x86_64/bin/viewer -nowin -noint -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key e8b214d3ea04131074e6
cdoutrix 18925  0.1  0.0  54172  8556 pts/29   S    07:08   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -mdserver -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key e8b214d3ea04131074e6
cdoutrix 18935  1.4  0.0 683016 67852 pts/29   S    07:08   0:00 /usr/gapps/visit/3.1.2/linux-x86_64/bin/mdserver -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key e8b214d3ea04131074e6
cdoutrix 18937  0.1  0.0  52048  8484 pts/29   S    07:08   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -engine -dir /usr/gapps/visit -idle-timeout 480 -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key e91f7d88fb410c1595f3
cdoutrix 18947  9.7  0.0 669652 116872 pts/29  Sl   07:08   0:04 /usr/gapps/visit/3.1.2/linux-x86_64/bin/engine_ser -dir /usr/gapps/visit -idle-timeout 480 -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06 --silo_dir=HWH_domain_72.laserPowerMult_1.0.minimalAle_0.NODES_8.NODES_XENA_1.PROC_288.PROCS_XENA_36.meshResolution_0.25.foamDensity_2e-06-288/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key e91f7d88fb410c1595f3
cdoutrix 19076  0.0  0.0   9584  1504 pts/29   S    07:09   0:00 /bin/bash /p/lustre1/cdoutrix/ALE/Hohlraum/generate_hohlraum_20200604-070657/movies/domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35/movies_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35.slurm.sh
cdoutrix 19079  0.1  0.0  52044  8456 pts/29   S    07:09   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s /usr/workspace/aml_cs/ALE/LAGER/data-generation/Hohlraum/view_variable_visit_at_time_or_cycle.py --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations
cdoutrix 19089  1.6  0.0 384472 32476 pts/29   Sl   07:09   0:00 /usr/gapps/visit/3.1.2/linux-x86_64/bin/cli -nowin -no-launch-x -s /usr/workspace/aml_cs/ALE/LAGER/data-generation/Hohlraum/view_variable_visit_at_time_or_cycle.py --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations
cdoutrix 19090  0.1  0.0  52044  8472 pts/29   S    07:09   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -viewer -noint -nowin -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key 631187ca1553b6774f25
cdoutrix 19102 13.7  0.0 968912 100192 pts/29  Rl   07:09   0:04 /usr/gapps/visit/3.1.2/linux-x86_64/bin/viewer -nowin -noint -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key 631187ca1553b6774f25
cdoutrix 19148  0.2  0.0  54172  8556 pts/29   S    07:09   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -mdserver -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key 631187ca1553b6774f25
cdoutrix 19158  5.1  0.0 694068 78824 pts/29   S    07:09   0:01 /usr/gapps/visit/3.1.2/linux-x86_64/bin/mdserver -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key 631187ca1553b6774f25
cdoutrix 19161  0.2  0.0  52048  8484 pts/29   S    07:09   0:00 /usr/gapps/visit/bin/../current/linux-x86_64/bin/python /usr/gapps/visit/bin/frontendlauncher.py /usr/gapps/visit/bin/visit -v 3.1 -engine -dir /usr/gapps/visit -idle-timeout 480 -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host rztopaz188.llnl.gov -port 5600 -key f9ef0dd80b119c65a5c3
cdoutrix 19171 64.2  0.0 667348 114656 pts/29  Sl   07:09   0:19 /usr/gapps/visit/3.1.2/linux-x86_64/bin/engine_ser -dir /usr/gapps/visit -idle-timeout 480 -no-launch-x --root /p/lustre1/cdoutrix/ALE/Hohlraum -r HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35 --silo_dir=HWH_domain_18.laserPowerMult_1.0.minimalAle_0.NODES_2.NODES_XENA_1.PROC_72.PROCS_XENA_18.meshResolution_0.15.foamDensity_0.35-72/VIZ --world_coordinates=(0.45, 0.62, 0.17, 0.315) --variables Material Pressure Temperature --state=-3 --fixed_boundaries --annotations -host 127.0.0.1 -port 5600 -key f9ef0dd80b119c65a5c3
cdoutrix 19227  0.0  0.0  53844  1804 pts/29   R+   07:09   0:00 ps -aux
cdoutrix 19228  0.0  0.0   9108   944 pts/29   R+   07:09   0:00 grep --color=au

yaml file

description:
  name: generate_hohlraum
  description: Runs Hohlraum Simulation

env:
  labels:
      RUN_PATH: /p/lustre1/cdoutrix/ALE/Hohlraum
      OUTPUT_PATH: /p/lustre1/cdoutrix/ALE/Hohlraum
      SCRIPTS_DIR: /usr/workspace/aml_cs/ALE/LAGER/data-generation/Hohlraum
      MASH_CONFIG_FILE: mash_kull_config.ascii
      ALEEVERY: 1
      MOVIE_VARIABLES: Material Pressure Temperature
      PROCESSORS_PER_NODE: 36
      PROCESSORS_PER_DOMAIN: 4
batch:
  type:  slurm
  queue: pbatch
  host:  rztopaz
  bank:  wbronze

generator.parameters:
  RHO_FOAM2:
    values: [ 2.e-6, 0.350, ]
    label: foamDensity_%%
  LASPOWERMULT:
    values: [ 1.0, ]
    label: laserPowerMult_%%
  RESOLUTION:
    values: [ .15, .25, .5, ]
    label: meshResolution_%%
  DOMAIN:
    values: [ 18, 72, 288, ]
    label: domain_%%
  MINIMALALE:
    values: [ 0, 1 ]
    label: minimalAle_%%
  LINKED:
    RESOLUTION: [ "DOMAIN",]

study:
      #- name: mesh
      #  description: generates Mesh.
      #  run:
      #    cmd: |
      #      cd $(SCRIPTS_DIR)/Mesh1
      #      [ ! -f HWH_resolution_$(printf "%.6f\n" $(RESOLUTION) ).sat ] && /usr/apps/pmesh/bin/draco.new draco_mesh.py -- meshResolution=$(RESOLUTION) || echo "Found mesh file, nothing to do"
      #    depends: []
#
      #- name: xena
      #  description: run xena
      #  run:
      #    cmd: |
      #      cd $(SCRIPTS_DIR)
      #      [ ! -f Mesh1/HWH_$(DOMAIN).silo ] && $(LAUNCHER) /usr/apps/pmesh/bin/xena.new -i Mesh1/HWH_resolution_$(RESOLUTION)*.sat -overlink Mesh1/HWH_$(DOMAIN) -numDomains $(DOMAIN) || echo "Found silo file, nothing to do"
      #    depends: [mesh]
      #    procs: $(PROCS_XENA)
      #    nodes: $(NODES_XENA) 
      #    walltime: "00:30:00"
#
      #- name: simulation
      #  description: the actual simulation
      #  run:
      #    cmd: |
      #      cd $(SCRIPTS_DIR)
      #      echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
      #      $(LAUNCHER) /usr/tce/bin/python /usr/apps/kull/bin/kull.dev deck_karen.py restartFrom=-1 runDirBaseName=$(RUN_PATH)/HWH_$( basename $(WORKSPACE) ) nDomains=$(DOMAIN) meshResolution=$(RESOLUTION) mashConfigFile=$(SCRIPTS_DIR)/$(MASH_CONFIG_FILE) rho_foam2=$(RHO_FOAM2) lasPowerMult=$(LASPOWERMULT) aleevery=$(ALEEVERY) minimalALE=$(MINIMALALE) || :
      #    depends: [xena]
      #    procs: $(PROC)
      #    nodes: $(NODES)
      #    walltime: "24:00:00"

      - name: nodes
        description: Convert extracts to HDF5 format
        # We need the choe line to print all used param in simualtion to match directory names
        run:
          cmd: |
           echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
           $(LAUNCHER) python $(SCRIPTS_DIR)/binary_2_hdf5.py --root $(RUN_PATH) -r HWH_$( basename $(WORKSPACE) ) -m "{'node': None}"
          depends: []
          procs:  $(PROCESSORS_PER_NODE) 
          nodes: 1
          walltime: "4:00:00"

      - name: relax
        description: Convert extracts to HDF5 format
        # We need the choe line to print all used param in simualtion to match directory names
        run:
          cmd: |
           echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
           $(LAUNCHER) python $(SCRIPTS_DIR)/binary_2_hdf5.py --root $(RUN_PATH) -r HWH_$( basename $(WORKSPACE) ) -m "{'scalarRlxData': None, 'dimRlxData':None}"
          depends: [nodes]
          procs:  $(PROCESSORS_PER_NODE) 
          nodes: 1
          walltime: "4:00:00"

      - name: zones
        description: Convert extracts to HDF5 format
        # We need the choe line to print all used param in simualtion to match directory names
        run:
          cmd: |
           echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
           $(LAUNCHER) python $(SCRIPTS_DIR)/binary_2_hdf5.py --root $(RUN_PATH) -r HWH_$( basename $(WORKSPACE) ) -m "{'zone':None}"
          depends: [relax]
          procs:  $(PROCESSORS_PER_NODE) 
          nodes: 1
          walltime: "4:00:00"

      - name: movies
        description: Generates some movies based on silo files
        run:
          cmd: |
            echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
            export RUNDIR=HWH_$( basename $(WORKSPACE))
            cd $(RUN_PATH)/$RUNDIR/$RUNDIR-$(PROC)/VIZ
            ls *.silo > db.visit
            /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s $(SCRIPTS_DIR)/view_variable_visit_at_time_or_cycle.py --root $(RUN_PATH) -r $RUNDIR --silo_dir=$RUNDIR-$(PROC)/VIZ --world_coordinates='(0.45, 0.62, 0.17, 0.315)' --variables "$(MOVIE_VARIABLES)" --state=-3 --fixed_boundaries --annotations
            /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s $(SCRIPTS_DIR)/view_variable_visit_at_time_or_cycle.py --root $(RUN_PATH) -r $RUNDIR --silo_dir=$RUNDIR-$(PROC)/VIZ --world_coordinates='(-0.25, 0.85, 0, 0.55)' --variables "$(MOVIE_VARIABLES)" --state=-3 --fixed_boundaries --annotations
            /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s $(SCRIPTS_DIR)/view_variable_visit_at_time_or_cycle.py --root $(RUN_PATH) -r $RUNDIR --silo_dir=$RUNDIR-$(PROC)/VIZ --world_coordinates='(0.45, 0.62, 0.17, 0.315)' --variables "$(MOVIE_VARIABLES)" --state=-3 --fixed_boundaries --mesh --annotations
            /usr/gapps/visit/bin/visit -nowin -no-launch-x -cli -s $(SCRIPTS_DIR)/view_variable_visit_at_time_or_cycle.py --root $(RUN_PATH) -r $RUNDIR --silo_dir=$RUNDIR-$(PROC)/VIZ --world_coordinates='(-0.25, 0.85, 0, 0.55)' --variables "$(MOVIE_VARIABLES)" --state=-3 --fixed_boundaries --mesh --annotations
          depends: []

      - name: kosh
        description: add simulation to kosh
        run:
          cmd: |
            echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
            export RUNDIR=HWH_$( basename $(WORKSPACE))
            python $(SCRIPTS_DIR)/add_to_kosh.py --store=/usr/workspace/aml_cs/kosh/kosh_store.sql --root $(RUN_PATH) -n $RUNDIR
          depends: [nodes, relax, zones, movies]
      
      - name: directory_permissions
        description: fix directory permissions
        run:
          cmd: |
            echo $(PROC).$(NODES).$(DOMAIN).$(RESOLUTION).$(RHO_FOAM2).$(LASPOWERMULT).$(MINIMALALE).$(PROCS_XENA).$(NODES_XENA)
            export RUNDIR=HWH_$( basename $(WORKSPACE))
            find $(RUN_PATH)/$RUNDIR  -type f -exec chmod g+r  {} +
            find $(RUN_PATH)/$RUNDIR  -type d -exec chmod g+x  {} +
            chgrp -R aml_cs $(run_PATH)/$RUNDIR
          depends: [zones, movies]

genrator py file

import sys
from maestrowf.datastructures.core import ParameterGenerator
from sklearn.model_selection import ParameterGrid
import yaml
try:
  from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
  from yaml import Loader, Dumper


def compute_nodes(number_procs, proc_per_node):
    if number_procs % proc_per_node == 0:
      add_node = 0
    else:
      add_node = 1
    return number_procs // proc_per_node + add_node

def compute_xena_nodes_and_procs(number_domains, proc_per_node):
    # Hard code to 4 domains per processor and use every processor on the node
    domains_per_proc = 4
    domains_per_node = proc_per_node*domains_per_proc
    add_node = 0
    if number_domains%domains_per_node != 0 :
      add_node = 1
    nodes = number_domains // domains_per_node + add_node
    # Use every processor on the nodes, up to the number of domains
    procs = min( number_domains, proc_per_node*nodes )
    return ( nodes, procs )
  

def get_custom_generator(env, **kwargs):
  """
Create a custom populated ParameterGenerator.

This function recreates the exact same parameter set as the sample LULESH
specifications. The point of this file is to present an example of how to
generate custom parameters.

:returns: A ParameterGenerator populated with parameters.
"""
  import sys
  p_gen = ParameterGenerator()
  yml = yaml.load(open(sys.argv[-1]).read(), Loader=Loader)
  p_in = {}
  labels = {}

  linked_ps = {}
  linked = {}

  for k, val in yml["generator.parameters"].items():
    if "values" in val:
        if isinstance(val["values"], (list,tuple)):
            p_in[k] = list(val["values"])
        else:
            p_in[k] = [val["values"],]
        labels[k] = val["label"]
    elif k == "LINKED":
        linked = val

  for plink in linked:
    for link in linked[plink]:
        if not plink in linked_ps:
            linked_ps[plink] = {}
        linked_ps[plink][link] = p_in.pop(link)

  grid = ParameterGrid(p_in)
  p = {}
  for g in grid:
    for k in g:
        if k not in p:
            p[k] = [g[k], ]
            if k in linked_ps:
                for link in linked_ps[k]:
                    p[link] = [linked_ps[k][link][p_in[k].index(g[k])],]
        else:
            p[k].append(g[k])
            if k in linked_ps:
                for link in linked_ps[k]:
                    p[link].append(linked_ps[k][link][p_in[k].index(g[k])])

  # now we have some magic to do:
  # first use global env to figure number of procs per node
  proc_per_node = yml['env']['labels']['PROCESSORS_PER_NODE']
  proc_per_domain = yml['env']['labels']['PROCESSORS_PER_DOMAIN']

  p["PROC"] = []
  p["NODES"] = []
  p['PROCS_XENA'] = []
  p['NODES_XENA'] = []

  for i, d in enumerate(p["DOMAIN"]):
      p["PROC"].append(d*proc_per_domain)
      p["NODES"].append(compute_nodes(p["PROC"][-1], proc_per_node))
      # Xena can create sub domains for every processor.  Create 4 domains
      # per processor to reduce the node count needed.
      xena_nodes_and_procs = compute_xena_nodes_and_procs(d,proc_per_node)
      p["NODES_XENA"].append(xena_nodes_and_procs[0])
      p["PROCS_XENA"].append(xena_nodes_and_procs[1])

  labels["PROC"] = "PROC_%%"
  labels["NODES"] = "NODES_%%"
  labels["NODES_XENA"] = "NODES_XENA_%%"
  labels["PROCS_XENA"] = "PROCS_XENA_%%"
  for k, val in p.items():
    print("P:",k,val)
    p_gen.add_parameter(k, val, labels[k])
  return p_gen

doutriaux1 avatar Jun 04 '20 14:06 doutriaux1

I believe what you're seeing is the result that you have a local step in parallel with the scheduled step. Since the local step is long-running, the other scripts aren't generated until the first local step is done -- so everything is stuck behind that local step (script generation and everything). That's why you see the processes appear, but not the full set and you don't see the other scripts.

This behavior will be changing in the near future. The local execution will be parallelized and behave more like a scheduled step where processes will be started and the workflow will be allowed to continue. It does look like the conductor itself is still running, so Maestro hasn't crashed.

That said, the status that gets dumped gets written after the first set of steps -- which means the local execution of a slow step hold that up too. I think that better user feedback here might be to dump the status first so that it doesn't give the impression that the job failed.

@doutriaux1 -- What do you think?

FrankD412 avatar Jun 04 '20 16:06 FrankD412

@FrankD412 I agree the message is confusing and I ended up restarting maestro many times before realizing I was filling my node with long running jobs. So I agree that giving a better message to the user would be better. I'm willing to beta test if you want.

doutriaux1 avatar Jun 04 '20 20:06 doutriaux1

@doutriaux1 -- Sounds good. I'm working on a prototype and will let you know when you can give it a shot.

FrankD412 avatar Jun 05 '20 16:06 FrankD412

@FrankD412 as an FYI I seem t be getting a similar behavior when using:

  type: local_parallel

But I'm not 100% this is an officially supported feature

doutriaux1 avatar Jun 08 '20 18:06 doutriaux1

@doutriaux1 -- that's not currently a supported adapter in the current version. Are you referring to my fork?

FrankD412 avatar Jun 08 '20 18:06 FrankD412

@FrankD412 I've seen it in another study and was trying out becauseI thought it was already back in the official repo. That explains why it confuses the status.

doutriaux1 avatar Jun 08 '20 18:06 doutriaux1

@doutriaux1 -- Oh got it. Yeah, that's a different prototype for running a conductor locally in an allocation. I definitely aliased the name in my fork. Sorry about the confusion.

FrankD412 avatar Jun 08 '20 18:06 FrankD412