mrjob
mrjob copied to clipboard
Oops, ssh subprocess exited with return code 255, restarting
I am using ssh keys in Amazon EMR and get the error Oops, ssh subprocess exited with return code 255, restarting
many time.
RUNNING for 1:10:08
Oops, ssh subprocess exited with return code 255, restarting...
Opening ssh tunnel to resource manager...
Connect to resource manager at: http://localhost:40685/cluster
When using the verbose mode "-v" I get the following output:
Waiting 30.0 seconds...
Oops, ssh subprocess exited with return code 255, restarting...
Opening ssh tunnel to resource manager...
Created empty ssh known-hosts file: /var/folders/qr/nd8l4bk528772jshgnxgwsym0000gn/T/XXX.20180504.110507.270074/fake_ssh_known_hosts
> ssh -o VerifyHostKeyDNS=no -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes -o UserKnownHostsFile=/var/folders/qr/nd8l4bk528772jshgnxgwsym0000gn/T/XXX.20180504.110507.270074/fake_ssh_known_hosts -L 40563:172.31.35.149:9026 -N -n -q -i /XXX.pem [email protected]
Connect to resource manager at: http://localhost:40563/cluster
RUNNING for 0:02:39
Fetching progress from resource manager at http://localhost:40563/cluster
failed: <urlopen error [Errno 61] Connection refused>
Fetching progress from resource manager over SSH
> ssh -i /XXX.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o VerifyHostKeyDNS=no [email protected] curl http://172.31.35.149:9026/cluster
failed: ssh: connect to host ec2-34-211-67-146.us-west-2.compute.amazonaws.com port 22: Operation timed out
Is there a way to disable the output or further debug the problem?
Can you tell me more about your setup, so I can try to duplicate the problem you're having? It could be as simple as mrjob not having the right port number for the resource manager.
Lets see. I'm using mrjob 0.6.2 and start a MR-Job on EMR in Oregon (us-west-2) based on python2.7. As of ports - I'm not aware that my computer (MacOS) or the EMR Servers block a port. The only port mentioned is port 40563 (see above) but I couldn't find a reference to a problem with it.
We had some problems with the "current" image_version (5.x) and went back to 3.0.4 - just in case the ssh/log problems are version-specific.
Furthermore, I created our IAM Role ~12 hours ago when IAM had timeout problem (i.e., the User/Role might have problems)
Nevertheless, small jobs work without problems except for the strange logs.
Weird thing: I just logged manually into the master node and added it to the known hosts and the mrjob log changed:
...
RUNNING for 3:15:14
Oops, ssh subprocess exited with return code 255, restarting...
Opening ssh tunnel to resource manager...
Connect to resource manager at: http://localhost:40191/cluster
RUNNING for 3:17:01
95.0% complete
RUNNING for 3:18:43
...
Wouldn't that mean the "new" clusters always have this log problem until one adds them to the known hosts? And is there a way to enable this via a mrjob setting? I'm currently not working with a "stand-by" cluster but always start one anew on EMR.
mrjob runs ssh
with -o UserKnownHostsFile=/path/to/fake_known_hosts_file
, where fake_known_hosts_file
is an initially empty file that mrjob controls.
It sounds like your SSH binary is either ignoring the UserKownHostsFile
option, or doesn't like the path for some reason. Possibly an issue like this? https://superuser.com/questions/1112122/windows-openssh-ignoring-userknownhostsfile-option
Oh, it looks like you're on a mac, so it wouldn't be a file path issue. Maybe more of a network issue? Looks like you're getting a connection timeout. Not sure why manually SSHing in would fix a network issue though.
@davidmarin @joergrech I am facing the same issue. I don't know how to resolve the issue. Could you please tell me the steps how to do it?