maestrowf icon indicating copy to clipboard operation
maestrowf copied to clipboard

ssh to launch fails

Open aowen87 opened this issue 3 years ago • 5 comments

I have a simple yaml file that I use to launch a study. The details of the study aren't that important, and I don't think they relate to this problem.

Here's the problem:

I'm on rztopaz, and I can launch the study by simply running maestro run -y path/to/study.yaml. That works fine.

What I really need, though, is to launch the study on rzansel from rztopaz. So, I tried running something very similar: ssh rzansel 'cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml'

When I do this, it says the study launched successfully, but it clearly dies while trying to set things up. Here's the error I get in the log file:

[2021-11-05 08:24:18,424: INFO] Checking DAG status at 2021-11-05 08:24:18.424257
[2021-11-05 08:24:18,427: ERROR] Error code '127' seen. Unexpected behavior encountered.
[2021-11-05 08:24:18,427: ERROR] Unknown Error (Code = JobStatusCode.ERROR)
[2021-11-05 08:24:18,428: ERROR] Job status check failed -- Aborting.
[2021-11-05 08:24:18,428: ERROR] ('Job status check failed -- Aborting.',)
Traceback (most recent call last):
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 382, in main
    completion_status = conductor.monitor_study()
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/conductor.py", line 352, in monitor_study
    completion_status = dag.execute_ready_steps()
  File "/usr/WS2/maguire7/virtual_env/bvolatile/lib/python3.7/site-packages/maestrowf/datastructures/core/executiongraph.py", line 690, in execute_ready_steps
    raise RuntimeError(msg)
RuntimeError: Job status check failed -- Aborting.
[2021-11-05 08:24:18,429: INFO] Study exiting, cleaning up...
[2021-11-05 08:24:18,429: INFO] Squeaky clean!

If I run the following command from rzansel, everything works fine: cd /where/I/should/be; <activate environment>; maestro run -y path/to/study.yaml

This makes me think that the issue comes from the ssh call. Unfortunately, the ssh is required for my particular study. Any ideas here?

aowen87 avatar Nov 05 '21 15:11 aowen87

I think this might have to do with interactive vs non interactive environments. Adding some of the missing environment variables from an interactive session gets me further. It still fails eventually, but the study is actually launched.

aowen87 avatar Nov 05 '21 18:11 aowen87

Hi @aowen87 -- thanks for the bug report; I was thinking it might be something with the interactive environment. Do you happen to have an error for the one you got launched but that ended up failing?

FrankD412 avatar Nov 05 '21 21:11 FrankD412

I'm pretty sure that second error was actually my fault. I've gotten a bit distracted and haven't had a chance to look at this again, but I'll post more info later if I'm wrong about this.

aowen87 avatar Nov 09 '21 21:11 aowen87

@aowen87 -- No worries and no rush; I just wanted to make sure that you got all the support you needed. :-)

FrankD412 avatar Nov 09 '21 21:11 FrankD412

Thanks! I appreciate it!

aowen87 avatar Nov 09 '21 21:11 aowen87