batchspawner icon indicating copy to clipboard operation
batchspawner copied to clipboard

Allow dictionary-based customization of exec_prefix for each primary function in SlurmSpawner (and other spawners)

Open abuettner93 opened this issue 7 months ago • 1 comments

I came across this issue while running batchspawner in a production environment - at scale, after spawning sessions, batchspawner uses the exec_prefix in conjunction with the batch_query_cmd to check job status, which puts a heavy load on sudo sessions being instantiated on the login node. The issue comes up when the number of users on the server gets high (200+) and gets overloaded with sudo sessions, and the server can start getting bogged down on CPU or over-logging when the max number of sudo sessions gets hit. The thing is, SLURM doesnt (always) require that the user who started the job be the one to check it, i.e. no need for the exec_prefix for the batch_query_cmd, since any user can check any other users job status. I ended up forking the current release and modifying the query_job_status function by overwriting it inside the SlurmSpawner class and removing the use of the exec_prefix to prevent sudo sessions being opened every 30 seconds for each user, however this is not the most ideal means to solve this problem. I have given a better proposal below, as well as my (hacky) alternative solution that I implemented while prod was slowly burning to the ground.

Proposed change

Create an option in the Jupyterhub configuration could be designed to specify what the exec_prefix will be for the three main functions (submit_batch_script, query_job_status, cancel_batch_job), allowing admins to configure which prefix is used where, or specify it as "" to turn off its use entirely.

This could be a dictionary with three keys, such as: {"submit_exec_prefix": "sudo -E -u {username}", "query_exec_prefix": "", "cancel_exec_prefix": "sudo -E -u {username}"}

Alternative options

Alternatively, overwrite the outside class definition of query_job_status function with a definition inside the SlurmSpawner class and remove the use of exec_prefix for the query_job_status function (or adding an option to allow the prefix to be turned off). This was my spur of the moment solution earlier today.

Who would use this feature?

I believe this feature would be useful for those who need to customize which commands use the exec_prefix. I use SlurmSpawner specifically, but im sure this feature would be useful in the other resource manager spawners as well.

(Optional): Suggest a solution

I feel like my proposed change is the most practical solution, at least from what I can see.

Thanks for reading :)

abuettner93 avatar Dec 05 '23 02:12 abuettner93

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] avatar Dec 05 '23 02:12 welcome[bot]