batchspawner icon indicating copy to clipboard operation
batchspawner copied to clipboard

Running a singularity image container with SLURM and batchspawner

Open geninv opened this issue 2 years ago • 2 comments

Bug description

We are trying to launch a singularity image container with SLURM. Jupyterhub is installed in a virtual machine and launch the singularity image containing jupyterlab in a job. The slurm job is correctly launched but it encounters an error before the process is created inside the SLURM job.
From what we can read in the logs, it seems that batchspawner is expecting a python script to launch, but the command line created use the singularity binary.

Something to note is that batchspawner worked with singularity in 0.8.2 but not in version 1.1.0. We think that it's because the batchspawner wrapper is waiting for a python script. Do you think it could work if we wrap the call to the singularity binary with a python script ? Or is there some other way to make them work together ?

Expected behaviour

The job is launched and we get access to the jupyterlab inside the singularity image.

Actual behaviour

The job encounters an error. We get a python error in the slurm logs :

Traceback (most recent call last):
  File "/softs/rh7/conda-envs/pangeo_latest/bin/batchspawner-singleuser", line 6, in <module>
    main()
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/site-packages/batchspawner/singleuser.py", line 23, in main
    run_path(cmd_path, run_name="__main__")
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 269, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 244, in _get_code_from_file
    code = compile(f.read(), fname, 'exec')
ValueError: source code string cannot contain null bytes
srun: error: node539: task 0: Exited with exit code 1

How to reproduce

Request a job running a singularity image using batchspawner.

Configuration
# jupyterhub_config.py
c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator'
c.JupyterHub.bind_url = 'http://127.0.0.1:8000'
c.JupyterHub.cleanup_servers = False

c.JupyterHub.db_url = 'sqlite:////etc/jupyterhub/jupyterhub.sqlite'
c.JupyterHub.hub_ip = '0.0.0.0'
c.JupyterHub.hub_port = <hub_port>
import batchspawner
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesCmdSpawner'
c.Spawner.http_timeout = 120
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '4000mb'
c.BatchSpawnerBase.req_prologue = '''
source ~/.bashrc
export JUPYTER_PATH=$JUPYTER_PATH:/softs/rh7/jupyter_kernels/
export PS1='hub-[\\u@\\h \\W]\\$'
module load latex
echo "INFO | Using default notebook env : pangeo_latest"
module load conda
conda activate /softs/rh7/conda-envs/pangeo_latest
unset PKG_CONFIG_PATH
unset PYTHONPATH
'''
c.BatchSpawnerBase.req_queue = 'qdev'
c.BatchSpawnerBase.exec_prefix = 'sudo -E -u {username} env PATH=$PATH'
c.SlurmSpawner.batch_script = '''#!/bin/sh
#SBATCH --output={{homedir}}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir={{homedir}}
#SBATCH --export=ALL
#SBATCH --get-user-env=L
{% if partition  %}#SBATCH --partition={{partition}}
{% endif %}{% if runtime    %}#SBATCH --time={{runtime}}
{% endif %}{% if memory     %}#SBATCH --mem={{memory}}
{% endif %}{% if gres       %}#SBATCH --gres={{gres}}
{% endif %}{% if nprocs     %}#SBATCH --cpus-per-task={{nprocs}}
{% endif %}{% if reservation%}#SBATCH --reservation={{reservation}}
{% endif %}{% if options    %}#SBATCH {{options}}{% endif %}
trap 'echo SIGTERM received' TERM
{{prologue}}
{% if srun %}srun {% endif %}{{cmd}}
echo "jupyterhub-singleuser ended gracefully"
{{epilogue}}
'''
c.ProfilesSpawner.profiles = [
   ('Standard (visu) - 1 core, 5 GB, 1 week -- Default', 'qnotebook1c5g', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='1', req_queue='qnotebook', req_runtime='168:00:00', req_memory='5GB')),
   ('Standard (visu) - 4 cores, 20 GB, 1 week', 'qnotebook4c20g', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='4', req_queue='qnotebook', req_runtime='168:00:00', req_memory='20GB')),
   ('Qdev - 1 cores, 4 GB, 12 hours', 'qdev1c4g', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='1', req_queue='qdev', req_memory='4GB')),
   ('Qdev - 4 cores, 15 GB, 12 hours', 'qdev4c15g', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='4', req_queue='qdev', req_memory='15GB')),
   ('Qdev full node - 16 cores,  60GB', 'qdevfull', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='16', req_queue='qdev', req_memory='60GB')),
   ('Batch - 1 cores, 5 GB, 12 hours', 'batch1c5g12h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='1', req_queue='batch', req_runtime='12:00:00', req_memory='5GB')),
   ('Batch - 1 cores, 5 GB, 72 hours', 'batch1c5g12h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='1', req_queue='batch', req_runtime='72:00:00', req_memory='5GB')),
   ('Batch - 4 cores, 20 GB, 12 hours', 'batch4c20g12h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='4', req_queue='batch', req_runtime='12:00:00', req_memory='20GB')),
   ('Batch full node - 24 cores, 120 GB, 12 hours', 'batchfull12h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='24', req_queue='batch', req_runtime='12:00:00', req_memory='120GB')),
   ('Batch 2019 full node - 40 cores, 184 GB, 12 hours', 'batch2019full12h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='40', req_queue='batch', req_runtime='12:00:00', req_memory='184GB')),
   ('GPGPU - 1 gpgpu T4 -- Default to use for GPU, 8 cores, 92 GB, 4 hours', 'gpu4h', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='8', req_queue='qgpgpudev', req_runtime='04:00:00', req_memory='92GB'))
   ]
SINGULARITY_BIND_OPTS = "$HOME:$HOME,/work/scratch/$USER:/scratch,/softs:/softs,/work:/work,/datalake:/datalake,/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem,/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt"
c.ProfilesCmdSpawner.env_list = [
                ('Default lab environnement - without VRE (all groups)',
                'jupyter-labhub'),
                ('VRECNES (vrecnes group only)',
                f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/souche/vrecnes-stable.simg --notebook-dir=$HOME'),
                ('VREOT (vreot group only)',
                f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/thematique/OT/vreot-stable.simg --notebook-dir=$HOME'),
    ('VREOT (All kernels, vreot group only)',
    f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/thematique/OT/vreot-all_kernels.simg --notebook-dir=$HOME'),
                ('VREAI4GEO (ai4geo group only)',
                f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/ai4geo/singularity/vreai4geo-stable.simg --notebook-dir=$HOME'),
    ('VRECESWOT (swotce_exp group only)',
                f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/swotce/singularity/exp/vreceswot-stable.simg --notebook-dir=$HOME')
]
c.JupyterHub.pid_file = '/etc/jupyterhub/pid'
c.JupyterHub.services = [
    {
        "name": "service-token",
        "admin": True,
        "api_token": "<api_token>",
    },
]
c.Spawner.cmd = ['${JUPYTERHUB_SINGLEUSER_CMD:-jupyter-labhub}']
c.Spawner.default_url = '/lab'
c.Spawner.ip = '0.0.0.0'
c.Spawner.poll_interval = 120
Logs
# Log Jupyterhub
Apr  6 09:46:51 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:51.113 JupyterHub log:189] 200 GET /hub/home (@XX.XX.XX.XX) 83.27ms
Apr  6 09:46:52 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:52.705 JupyterHub log:189] 200 GET /hub/spawn/XX (@XX.XX.XX.XX) 8.73ms
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.575 JupyterHub roles:477] Adding role server to token: <APIToken('822d...', user='XX', client_id='jupyterhub')>
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.592 JupyterHub provider:607] Creating oauth client jupyterhub-user-XX
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.644 JupyterHub batchspawner:262] Spawner submitting job using sudo -E -u XX env PATH=$PATH sbatch --parsable
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.644 JupyterHub batchspawner:263] Spawner submitted script:
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #!/bin/sh
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --output=/home/XX/jupyterhub_slurmspawner_%j.log
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --job-name=spawner-jupyterhub
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --chdir=/home/XX
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --export=ALL
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --get-user-env=L
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --time=168:00:00
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --mem=5GB
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --cpus-per-task=1
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: trap 'echo SIGTERM received' TERM
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: source ~/.bashrc
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # Environnements hub par défaut
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # export JUPYTER_PATH=$JUPYTER_PATH:/work/logiciels/rh7/Python/jupyter_data/share/jupyter/
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: export JUPYTER_PATH=$JUPYTER_PATH:/softs/rh7/jupyter_kernels/
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # Ci-dessous PS1 est exporté pour permettre l'utilisation de LateX dans un job
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # non interactif
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: export PS1='hub-[\u@\h \W]\$'
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: module load latex
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # echo "INFO | Using default notebook env: pangeo_202106"
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: # module load conda
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: #conda activate /softs/rh7/conda-envs/pangeo_202106
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: echo "INFO | Using default notebook env : pangeo_latest"
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: module load conda
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: conda activate /softs/rh7/conda-envs/pangeo_latest
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: unset PKG_CONFIG_PATH
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: unset PYTHONPATH
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: srun batchspawner-singleuser ${JUPYTERHUB_SINGLEUSER_CMD:-/softs/rh7/singularity/3.5.3/bin/singularity run --nv --add-caps CAP_NET_BIND_SERVICE --bind $HOME:$HOME,/work/scratch/$USER:/scratch,/softs:/softs,/work:/work,/datalake:/datalake,/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem,/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt /softs/projets/datalabs/images/souche/vrecnes-stable.simg --notebook-dir=$HOME}
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: echo "jupyterhub-singleuser ended gracefully"
Apr  6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.799 JupyterHub batchspawner:266] Job submitted. cmd: sudo -E -u XX env PATH=$PATH sbatch --parsable output: 975
Apr  6 09:46:58 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:46:58.570 JupyterHub base:187] Rolling back dirty objects IdentitySet([<Server(0.0.0.0:0)>])
Apr  6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.590 JupyterHub log:189] 302 POST /hub/spawn/XX -> /hub/spawn-pending/XX (@XX.XX.XX.XX) 1011.48ms
Apr  6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.653 JupyterHub pages:400] XX is pending spawn
Apr  6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.659 JupyterHub log:189] 200 GET /hub/spawn-pending/XX (@XX.XX.XX.XX) 13.75ms
Apr  6 09:47:07 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:47:07.570 JupyterHub base:1043] User XX is slow to start (timeout=10)
Apr  6 09:47:15 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:47:15.125 JupyterHub log:189] 200 POST /hub/api/batchspawner (@XX.XX.XX.XX) 17.82ms
Apr  6 09:47:15 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:47:15.573 JupyterHub batchspawner:419] Notebook server job 975 started at node539:52266
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:49:04.585 JupyterHub user:811] XX server never showed up at http://node539:52266/user/XX/ after 120 seconds. Giving up.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: Common causes of this timeout, and debugging tips:
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: 1. The server didn't finish starting,
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: or it crashed due to a configuration issue.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: Check the single-user server's logs for hints at what needs fixing.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: 2. The server started, but is not accessible at the specified URL.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: This may be a configuration issue specific to your chosen Spawner.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: Check the single-user server logs and resource to make sure the URL
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: is correct and accessible from the Hub.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: 3. (unlikely) Everything is working, but the server took too long to respond.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: To fix: increase `Spawner.http_timeout` configuration
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: to a number of seconds that is enough for servers to become responsive.
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: [E 2022-04-06 09:49:04.791 JupyterHub gen:623] Exception in Future <Task finished name='Task-27' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/handlers/base.py:934> exception=TimeoutError("Server at http://node539:52266/user/XX/ didn't respond in 120 seconds")> after timeout
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: Traceback (most recent call last):
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/tornado/gen.py", line 618, in error_callback
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: future.result()
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/handlers/base.py", line 941, in finish_user_spawn
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: await spawn_future
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 792, in spawn
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: await self._wait_up(spawner)
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 836, in _wait_up
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: raise e
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 806, in _wait_up
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: resp = await server.wait_up(
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/utils.py", line 241, in wait_for_http_server
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: re = await exponential_backoff(
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/utils.py", line 189, in exponential_backoff
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: raise asyncio.TimeoutError(fail_message)
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: asyncio.exceptions.TimeoutError: Server at http://node539:52266/user/XX/ didn't respond in 120 seconds
Apr  6 09:49:04 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:49:04.796 JupyterHub log:189] 200 GET /hub/api/users/geninv/server/progress [email protected] 125979.14ms
# SLURM log
Traceback (most recent call last):
  File "/softs/rh7/conda-envs/pangeo_latest/bin/batchspawner-singleuser", line 6, in <module>
    main()
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/site-packages/batchspawner/singleuser.py", line 23, in main
    run_path(cmd_path, run_name="__main__")
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 269, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 244, in _get_code_from_file
    code = compile(f.read(), fname, 'exec')
ValueError: source code string cannot contain null bytes
srun: error: node539: task 0: Exited with exit code 1
jupyterhub-singleuser ended gracefully

geninv avatar Apr 12 '22 09:04 geninv

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] avatar Apr 12 '22 09:04 welcome[bot]

I have got this working in LSF, we have to ensure that batchspawner is installed in the singularity instance if that helps ?

jbeal-work avatar Sep 29 '22 09:09 jbeal-work