maestrowf
maestrowf copied to clipboard
ssh to launch fails
I have a simple yaml file that I use to launch a study. The details of the study aren't that important, and I don't think they relate to this problem.
Here's the problem:
I'm on rztopaz, and I can launch the study by simply running maestro run -y path/to/study.yaml
. That works fine.
What I really need, though, is to launch the study on rzansel from rztopaz. So, I tried running something very similar:
ssh rzansel 'cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml'
When I do this, it says the study launched successfully, but it clearly dies while trying to set things up. Here's the error I get in the log file:
[2021-11-05 08:24:18,424: INFO] Checking DAG status at 2021-11-05 08:24:18.424257
[2021-11-05 08:24:18,427: ERROR] Error code '127' seen. Unexpected behavior encountered.
[2021-11-05 08:24:18,427: ERROR] Unknown Error (Code = JobStatusCode.ERROR)
[2021-11-05 08:24:18,428: ERROR] Job status check failed -- Aborting.
[2021-11-05 08:24:18,428: ERROR] ('Job status check failed -- Aborting.',)
Traceback (most recent call last):
File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 382, in main
completion_status = conductor.monitor_study()
File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 352, in monitor_study
completion_status = dag.execute_ready_steps()
File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/datastructures/core/executiongraph.py", line 690, in execute_ready_steps
raise RuntimeError(msg)
RuntimeError: Job status check failed -- Aborting.
[2021-11-05 08:24:18,429: INFO] Study exiting, cleaning up...
[2021-11-05 08:24:18,429: INFO] Squeaky clean!
If I run the following command from rzansel, everything works fine:
cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml
This makes me think that the issue comes from the ssh call. Unfortunately, the ssh is required for my particular study. Any ideas here?
I think this might have to do with interactive vs non interactive environments. Adding some of the missing environment variables from an interactive session gets me further. It still fails eventually, but the study is actually launched.
Hi @aowen87 -- thanks for the bug report; I was thinking it might be something with the interactive environment. Do you happen to have an error for the one you got launched but that ended up failing?
I'm pretty sure that second error was actually my fault. I've gotten a bit distracted and haven't had a chance to look at this again, but I'll post more info later if I'm wrong about this.
@aowen87 -- No worries and no rush; I just wanted to make sure that you got all the support you needed. :-)
Thanks! I appreciate it!