orbit icon indicating copy to clipboard operation
orbit copied to clipboard

[Bug Report] destination /storage doesn't exist in container

Open LarsDoorenbos opened this issue 1 year ago • 12 comments

Describe the bug

Following these steps to deploy Orbit on a slurm cluster, there is a fatal error in creating the container stating that the destination /storage doesn't exist in container.

Steps to reproduce

Follow the steps in the Cluster guide to run Orbit on a HPC cluster with slurm.

My local machine starts a job on the cluster, which in turn starts the job that builds the apptainer. The container creation fails due to an error with mounting. The log and error from the job that builds the apptainer looks as follows:

(run_singularity.py): Called on compute node with arguments
WARNING: nv files may not be bound with --writable
WARNING: By using --writable, Apptainer can't create /storage destination automatically without overlay or underlay
FATAL:   container creation failed: mount hook function failure: mount /var/apptainer/mnt/session/storage->/storage error: while mounting /var/apptainer/mnt/session/storage: destination /storage doesn't exist in container
(run_singularity.py): Return

This error is mentioned in some docs, where the note says to "add directories in the container for each of the bind mounts explicitly", but it is unclear to me how to fix it in this context.

My docker/.env does not specify /storage as a path anywhere, only as a prefix:

# Accept the NVIDIA Omniverse EULA by default
ACCEPT_EULA=Y
# NVIDIA Isaac Sim version to use (e.g. 2022.2.1)
ISAACSIM_VERSION=2023.1.0-hotfix.1
# Derived from the default path in the NVIDIA provided Isaac Sim container
DOCKER_ISAACSIM_PATH=/isaac-sim
# Docker user directory - by default this is the root user's home directory
DOCKER_USER_HOME=/root

###
# Cluster specific settings
###

# Docker cache dir for Isaac Sim (has to end on docker-isaac-sim)
# e.g. /cluster/scratch/$USER/docker-isaac-sim
CLUSTER_ISAAC_SIM_CACHE_DIR=/storage/workspaces/a*****/w****/lars/docker-isaac-sim
# Orbit directory on the cluster (has to end on orbit)
# e.g. /cluster/home/$USER/orbit
CLUSTER_ORBIT_DIR=/storage/homefs/l******/orbit
# Cluster login
CLUSTER_LOGIN=*****@****.ch
# Cluster scratch directory to store the SIF file
# e.g. /cluster/scratch/$USER
CLUSTER_SIF_PATH=/storage/workspaces/a*****/w****/lars
# Python executable within orbit directory to run with the submitted job
CLUSTER_PYTHON_EXECUTABLE=source/standalone/tutorials/00_sim/create_empty.py

System Info

Describe the characteristic of your environment:

  • Commit: 963f304
  • Isaac Sim Version: 2023.1.0-hotfix.1
  • OS (cluster): CentOS Linux 7 (Core)
  • GPU: RTX 3090
  • CUDA: 12.3
  • GPU Driver: 545.23.08

Checklist

  • [x] I have checked that there is no similar issue in the repo (required)
  • [x] I have checked that the issue is not in running Isaac Sim itself and is related to the repo

LarsDoorenbos avatar Jan 08 '24 14:01 LarsDoorenbos

@AutonomousHansen would you be able to help with this? Thanks

masoudmoghani avatar Jan 08 '24 17:01 masoudmoghani

@pascal-roth Do you have any ideas here? The only place I can see that a bind mount could potentially be causing problems is here, but the error message is saying that the directory doesn't exist on the container instead of the host?

hhansen-bdai avatar Jan 08 '24 18:01 hhansen-bdai

@AutonomousHansen I agree that line is the most probable cause. @LarsDoorenbos, can you ensure that the logs directory exists within your orbit directory on the cluster, it won't be synced to it and can be missing the first time you want to run the code.

pascal-roth avatar Jan 09 '24 09:01 pascal-roth

@pascal-roth Yes, the logs directory exists in the cluster Orbit directory.

LarsDoorenbos avatar Jan 09 '24 09:01 LarsDoorenbos

In the line mentioned by @AutonomousHansen, this logs directory is mounted to /workspace/orbit/logs, but adding ls /workspace to the run script gives ls: cannot access /workspace: No such file or directory. Maybe it should be mounted to a different place?

EDIT: changing /workspace/orbit to /storage/homefs/l******/orbit where orbit is located still gives the same error.

LarsDoorenbos avatar Jan 09 '24 09:01 LarsDoorenbos

that is clear, /workspace is defined within your docker image and cannot be accessed from outside. It is the directory where during the docker build process orbit is copied and installed (see here)

Try to comment out the line where logs are bound to the image, then we can be certain if this is causing the issue.

pascal-roth avatar Jan 09 '24 10:01 pascal-roth

Removing the -B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \ line still gives the same error.

LarsDoorenbos avatar Jan 09 '24 10:01 LarsDoorenbos

Manually adding the /storage folder to the orbit.sif folder does work, but now it gives an error with another folder:

(run_singularity.py): Called on compute node with arguments
WARNING: nv files may not be bound with --writable
WARNING: By using --writable, Apptainer can't create /root/.cache/ov destination automatically without overlay or underlay
FATAL:   container creation failed: mount hook function failure: mount /scratch/local/4319483/docker-isaac-sim/cache/ov->/root/.cache/ov error: while mounting /scratch/local/4319483/docker-isaac-sim/cache/ov: destination /root/.cache/ov doesn't exist in container
(run_singularity.py): Return

However, unlike before, the /root/.cache/ov does exist in the orbit.sif folder, so I can not do the same trick again...

Removing some of the binds gives the same error for a different bind, e.g. FATAL: container creation failed: mount hook function failure: mount /scratch/local/4319775/docker-isaac-sim/documents->/root/Documents error: while mounting /scratch/local/4319775/docker-isaac-sim/documents: destination /root/Documents doesn't exist in container, so something seems to be going wrong with the mounting in general.

LarsDoorenbos avatar Jan 09 '24 18:01 LarsDoorenbos

which apptainer version are you using on the cluster?

pascal-roth avatar Jan 10 '24 19:01 pascal-roth

apptainer version 1.1.3-1.el7. Maybe I should ask for an update ;)

LarsDoorenbos avatar Jan 11 '24 07:01 LarsDoorenbos

I agree; it seems like a general mounting error. It is difficult to reproduce from our side as we are running apptainer version 1.2.5-1.el7.

pascal-roth avatar Jan 11 '24 17:01 pascal-roth

For now, we found a different machine on which to run Orbit. Thanks anyway!

LarsDoorenbos avatar Jan 12 '24 07:01 LarsDoorenbos

Closing this issue for now, seems to be resolved.

pascal-roth avatar Sep 09 '24 16:09 pascal-roth