mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

Oops, ssh subprocess exited with return code 255, restarting

Open joergrech opened this issue 6 years ago • 6 comments

I am using ssh keys in Amazon EMR and get the error Oops, ssh subprocess exited with return code 255, restarting many time.

  RUNNING for 1:10:08
  Oops, ssh subprocess exited with return code 255, restarting...
  Opening ssh tunnel to resource manager...
  Connect to resource manager at: http://localhost:40685/cluster

When using the verbose mode "-v" I get the following output:

Waiting 30.0 seconds...
  Oops, ssh subprocess exited with return code 255, restarting...
  Opening ssh tunnel to resource manager...
Created empty ssh known-hosts file: /var/folders/qr/nd8l4bk528772jshgnxgwsym0000gn/T/XXX.20180504.110507.270074/fake_ssh_known_hosts
> ssh -o VerifyHostKeyDNS=no -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes -o UserKnownHostsFile=/var/folders/qr/nd8l4bk528772jshgnxgwsym0000gn/T/XXX.20180504.110507.270074/fake_ssh_known_hosts -L 40563:172.31.35.149:9026 -N -n -q -i /XXX.pem [email protected]
  Connect to resource manager at: http://localhost:40563/cluster
  RUNNING for 0:02:39
  Fetching progress from resource manager at http://localhost:40563/cluster
    failed: <urlopen error [Errno 61] Connection refused>
  Fetching progress from resource manager over SSH
  > ssh -i /XXX.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o VerifyHostKeyDNS=no [email protected] curl http://172.31.35.149:9026/cluster
    failed: ssh: connect to host ec2-34-211-67-146.us-west-2.compute.amazonaws.com port 22: Operation timed out

Is there a way to disable the output or further debug the problem?

joergrech avatar May 04 '18 13:05 joergrech

Can you tell me more about your setup, so I can try to duplicate the problem you're having? It could be as simple as mrjob not having the right port number for the resource manager.

coyotemarin avatar May 04 '18 16:05 coyotemarin

Lets see. I'm using mrjob 0.6.2 and start a MR-Job on EMR in Oregon (us-west-2) based on python2.7. As of ports - I'm not aware that my computer (MacOS) or the EMR Servers block a port. The only port mentioned is port 40563 (see above) but I couldn't find a reference to a problem with it.

We had some problems with the "current" image_version (5.x) and went back to 3.0.4 - just in case the ssh/log problems are version-specific.

Furthermore, I created our IAM Role ~12 hours ago when IAM had timeout problem (i.e., the User/Role might have problems)

Nevertheless, small jobs work without problems except for the strange logs.

joergrech avatar May 04 '18 17:05 joergrech

Weird thing: I just logged manually into the master node and added it to the known hosts and the mrjob log changed:

  ...
  RUNNING for 3:15:14
  Oops, ssh subprocess exited with return code 255, restarting...
  Opening ssh tunnel to resource manager...
  Connect to resource manager at: http://localhost:40191/cluster
  RUNNING for 3:17:01
    95.0% complete
  RUNNING for 3:18:43
  ...

Wouldn't that mean the "new" clusters always have this log problem until one adds them to the known hosts? And is there a way to enable this via a mrjob setting? I'm currently not working with a "stand-by" cluster but always start one anew on EMR.

joergrech avatar May 12 '18 16:05 joergrech

mrjob runs ssh with -o UserKnownHostsFile=/path/to/fake_known_hosts_file, where fake_known_hosts_file is an initially empty file that mrjob controls.

It sounds like your SSH binary is either ignoring the UserKownHostsFile option, or doesn't like the path for some reason. Possibly an issue like this? https://superuser.com/questions/1112122/windows-openssh-ignoring-userknownhostsfile-option

coyotemarin avatar Sep 08 '18 00:09 coyotemarin

Oh, it looks like you're on a mac, so it wouldn't be a file path issue. Maybe more of a network issue? Looks like you're getting a connection timeout. Not sure why manually SSHing in would fix a network issue though.

coyotemarin avatar Sep 08 '18 00:09 coyotemarin

@davidmarin @joergrech I am facing the same issue. I don't know how to resolve the issue. Could you please tell me the steps how to do it?

aaqibjavith avatar Oct 14 '18 06:10 aaqibjavith