batchspawner icon indicating copy to clipboard operation
batchspawner copied to clipboard

Add "connect to job" functionality, use that for CondorSpawner

Open olifre opened this issue 5 years ago • 8 comments

This adds the possibility to start a connect_to_job background task on the hub on job start, which establishes connectivity to the actual single user server. An example for this is condor_ssh_to_job for HTCondor batch systems.

Additionally, the background tasks are monitored:

  • for successful startup. The background task is given some time to successfully establish connectivity.
  • in poll() during job runtime

and if they fail, the job is terminated.

For the CondorSpawner, this leverages condor_ssh_to_job to forward the port of the single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes. Notably, it allows to use worker node setups where the workers do not allow for direct inbound connectivity (e.g. NATted setups), requiring only outbound connectivity from worker to hub, delegating the other direction to the batch system.

olifre avatar Dec 18 '20 09:12 olifre

Sounds like it would partially solve https://github.com/jupyterhub/batchspawner/issues/169, right? This + some client command to create new servers would allow to spawn workers on arbitrary locations.

Hoeze avatar Jan 31 '21 19:01 Hoeze

Sounds like it would partially solve #169, right?

It's surely an ingredient, but the focus here is different: It allows to connect to workers spawned by JupyterHub via a batch system command. In fact, this means if you spawn batch system workers in arbitrary locations, and the batch system picks them up, this means you can also run notebooks there.

That's also one of the ways we'd like to use it: Start a HTCondor startd, which registers with the batch system, and then let JupyterHub make use of it. This then works with arbitrary locations. The main ingredients here are, though:

  • The batch system allowing to dynamically add workers.
  • The batch system taking care of ensuring connectivity to that arbitrary location.

The necessary change in JupyterHub's batchspawner is that establishing connectivity can be delegated to an additional command (and in case of HTCondor, condor_ssh_to_job is used to delegate that to the batch system). This is what this PR implements.

So in short, yes, it can be seen as an ingredient also for the issue you linked :wink:.

olifre avatar Jan 31 '21 19:01 olifre

I pushed two more commits here which ensure no port collision happens for the forwarded port. This is achieved by choosing a random port on the Hub and forwarding the notebook to that, instead of using the same port on both ends.

Again, this is implemented in a generalizable manner, i.e. it is optional, and could also be used for other SSH tunneling approaches — however, I only added the condor_ssh_to_job connect command for the CondorSpawner since this is what I can test with.

olifre avatar Mar 09 '21 21:03 olifre

Rebased to current master to get the new CI working.

olifre avatar Mar 09 '21 21:03 olifre

I pushed three more commits with a small improvement, and two more changes:

  • It is now possible to disable the connect_to_job feature for the CondorSpawner again to allow reverting to the old behaviour, if needed for any reason.
  • The string {host} is now replaced with self.ip in the connect_to_job_cmd. This may be useful if an implementation for other batch systems is done, or somebody uses an additional service to proxy for these batch systems.

I see that this branch now has conflicts after https://github.com/jupyterhub/batchspawner/commit/1decdf259a25c8e33e90de7a5817da11a92c0326 which caused a lot of (useful) reformatting all over the project. I am not (yet) resolving these, but am certainly willing to do so once somebody steps up to start a review (to minimize effort, I'd prefer to resolve conflicts only once).

olifre avatar May 13 '21 20:05 olifre

I've taken the opportunity that 1.5 years have passed to:

  • Rebase this onto current master. Merging automatically was impossible due to the large reformatting, so I applied all changes one-by-one once more.
  • Adapt to changes in master, e.g. dropped support for JupyterHub <=0.7.
  • Add some fixes and disable tests for connect_to_job, since I did not have a good idea on how to emulate background commands and port forwarding in the test suite.

After this, the PR applies cleanly on master again. Furthermore, CI runs fine apart from Python 3.5 tests, which appear to be broken by an unrelated issue.

Hope this helps, we are still using this in production quite successfully :+1: .

olifre avatar Jul 11 '22 21:07 olifre