mrjob
mrjob copied to clipboard
ssh tunnels without job runners
When people use the same job flow for several jobs, they like to be able to just leave the same SSH tunnel open. Currently, ssh tunnels are tied to runners, so once a job finishes, the SSH tunnel goes away.
We should probably have a function in mrjob.ssh
that can create an SSH tunnel to a given job flow. It should probably return an object with an __exit__
method, so you can do:
with ssh_tunnel_to(job_flow_id):
...
I'm not sure the best way to pass parameters to this function. It needs to know:
- EMR connection settings (probably could just take an
EmrConnection
object - path to
.pem
file -
ssh
binary - a range of ports that we can listen on locally
- whether the SSH tunnel is open
All but the first two arguments can be defaulted. It might make sense to have one method in EMRJobRunner
that can create an SSH tunnel with no arguments, and another function in mrjob.ssh
that takes these arguments.
Maybe as part of the mrjob ssh
subcommand? (see #1113)
mrjob at least attempts to use the same port number on any given cluster by using the cluster ID as a seed for the random number generator (see #67).