SOS! Help! Cannot run container image nemo:25.07.gpt_oss! error: pyxis: error: a container with name "pyxis_nemo-25.07.gpt_oss" does not exist, and --container-image is not set
Hi there,
SOS! Could you please kindly help me? I just couldn't get pyxis work:
Thanks!
enroot import 'docker://nvcr.io#nvidia/nemo:25.07.gpt_oss'
enroot create --name pyxis_nemo-25.07.gpt_oss nvidia+nemo+25.07.gpt_oss.sqsh
And then since then I updated one of the sub-folders inside the container slightly by using the latest repo main branch, so I am not willing to directly to pull from NVIDIA registry in srun.
mike@ln01:~/enroot$ enroot list
cache
data
nemo-25.04
pyxis_hiyouga
pyxis_nemo-25.07.gpt_oss
pyxis_nemo-25.07.gpt_oss.sqsh
mike@ln01:~/enroot$ enroot version
3.5.0
mike@ln01:~/enroot$ ll
total 15122412
drwxr-xr-x 2 mike domain users 4096 Sep 25 17:05 ./
drwx------ 2 mike domain users 4096 Sep 25 16:43 ../
drwx------ 2 mike domain users 4096 Jun 30 11:48 cache/
drwx------ 2 mike domain users 4096 Sep 25 16:21 data/
drwxr-xr-x 2 mike domain users 4096 Jul 14 01:33 nemo-25.04/
drwxr-xr-x 2 mike domain users 4096 May 11 03:21 pyxis_hiyouga/
drwxrwxrwx 2 mike domain users 4096 Sep 21 17:57 pyxis_nemo-25.07.gpt_oss/
-rw-r----- 1 mike domain users 15485349888 Sep 24 22:53 pyxis_nemo-25.07.gpt_oss.sqsh
**mike@ln01:~/enroot$ srun --partition=gpu-interactive --account=gpu --container-name="nemo-25.07.gpt_oss" /bin/hostname**
srun: job 1276806 queued and waiting for resources
srun: job 1276806 has been allocated resources
slurmstepd-gn12: error: pyxis: error: a container with name "pyxis_nemo-25.07.gpt_oss" does not exist, and --container-image is not set
slurmstepd-gn12: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd-gn12: error: Failed to invoke spank plugin stack
srun: error: gn12: task 0: Exited with exit code 1
mike@ln01:~/enroot$ ll
total 15122412
drwxr-xr-x 2 mike domain users 4096 Sep 25 17:26 ./
drwx------ 2 mike domain users 4096 Sep 25 16:43 ../
drwx------ 2 mike domain users 4096 Jun 30 11:48 cache/
drwx------ 2 mike domain users 4096 Sep 25 16:21 data/
drwxr-xr-x 2 mike domain users 4096 Jul 14 01:33 nemo-25.04/
drwxr-xr-x 2 mike domain users 4096 May 11 03:21 pyxis_hiyouga/
drwxrwxrwx 2 mike domain users 4096 Sep 21 17:57 pyxis_nemo-25.07.gpt_oss/
-rw-r----- 1 mike domain users 15485349888 Sep 24 22:53 pyxis_nemo-25.07.gpt_oss.sqsh
mike@ln01:~/enroot$ echo $ENROOT_DATA_PATH
/home/mike/enroot
Note the following had no problem:
mike@ln01:~/enroot$ srun --nodes=1 --ntasks=1 --partition=gpu-interactive --account=gpu --container-image=docker://ubuntu:22.04 /bin/echo "Pyxis works"
srun: job 1276824 queued and waiting for resources
srun: job 1276824 has been allocated resources
pyxis: importing docker image: docker://ubuntu:22.04
pyxis: imported docker image: docker://ubuntu:22.04
Pyxis works
I have to get the local updated unpacked rootfs folder working (since I did the enroot import and then enroot create and then did some local updates already...)
I have also made the unpacked rootfs folder into a squashfs file:
Create a squashfs file from your unpacked rootfs
mksquashfs pyxis_nemo-25.07.gpt_oss pyxis_nemo-25.07.gpt_oss.sqsh -noappend
when I use this image file, it hung:
mike@ln01:~/enroot$ srun --nodes=1 --ntasks=1 --partition=gpu-interactive --account=gpu --container-image=/home/mike/enroot/pyxis_nemo-25.07.gpt_oss.sqsh /bin/echo "Pyxis works" srun: job 1276833 queued and waiting for resources srun: job 1276833 has been allocated resources
[* HUNG! I had to type CRTL+C to stop it *]
^Csrun: interrupt (one more within 1 sec to abort) srun: StepId=1276833.0 task 0: running ^Csrun: sending Ctrl-C to StepId=1276833.0 srun: forcing job termination srun: Job step aborted: Waiting up to 62 seconds for job step to finish. slurmstepd-gn12: error: *** STEP 1276833.0 ON gx12 CANCELLED AT 2025-09-25T18:09:16 *** slurmstepd-gn12: error: pyxis: child 1070021 terminated with signal 9 ^Csrun: sending Ctrl-C to StepId=1276833.0 srun: job abort in progress
I recall that when I created it, it had some warning message: Not sure if this warning message is the problem:
Parallel mksquashfs: Using 96 processors Creating 4.0 filesystem on pyxis_nemo-25.07.gpt_oss.sqsh, block size 131072. Could not open pyxis_nemo-25.07.gpt_oss/etc/slurm, skipping...
Hi @tjoymeed :
And then since then I updated one of the sub-folders inside the container slightly by using the latest repo main branch,
In that case, could you just mount the latest NeMo instead of modifying the sqsh ? I have never tried to modify the sqsh, and I would suggest:
- Either just mount the latest NeMo to /opt/NeMo
- Buid a new container with Dockerfile for this
@tjoymeed note that pyxis only supports a docker URI (docker://ubuntu:22.04) or squashfs
Our recommended workflow (what we use internally all the time) is just to mount the NeMo directory inside if all that's changing is NeMo.
Okay, could you please provide a fully replicable example (including the sbatch script and the python script and the sample data) for the performance script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/performance/llm/pretrain_qwen3_30b_a3b.py
I am using exactly the same hardware as in your recommended configs:
2 nodes with 16 80GB VRAM H100s.
Thanks a lot!
@terrykong @suiyoubi
Our recommended workflow (what we use internally all the time) is just to mount the NeMo directory inside if all that's changing is NeMo.
Interesting!
Could you please give me a sample call that how you mount the latest NeMo directory (latest git cloned) to the older container image (say Nemo:25.07.gpt_oss) ?
And also, what about the Megatron-Core directory?
What are the key related directories that I should keep updated through the mounting method you've suggested? I am guessing the NeMo directory and the Megatron-Core directory? What else?
Thanks a lot!