singularity
singularity copied to clipboard
NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host!
Version of Singularity
$ singularity --version
singularity-ce version 3.8.0
Describe the bug When running the singularity image on NVIDIA Jetson AGX, the singularity cannot find nv files.
To Reproduce Steps to reproduce the behavior: We use the singularity image from here: https://github.com/vras-robotour/deploy, on NVIDIA Jetson. Running the following command in the deploy directory
$ ./scripts/start_singularity.sh --nv
=========== STARTING SINGULARITY CONTAINER ============
INFO: Singularity is already installed.
INFO: Updating repository to the latest version.
Already up to date.
INFO: Mounting /snap directory.
INFO: Starting Singularity container from image robotour_arm64.simg.
INFO: Could not find any nv files on this host!
INFO: The catkin workspace is already initialized.
================== UPDATING PACKAGES ==================
INFO: Updating the package naex to the latest version.
Already up to date.
INFO: Updating the package robotour to the latest version.
Already up to date.
INFO: Updating the package map_data to the latest version.
Already up to date.
INFO: Updating the package test_package to the latest version.
Already up to date.
=======================================================
INFO: Starting interactive bash while sourcing the workspace.
Expected behavior Expected behavior is one where the nv files are found and we would be able to use pytorch. with cuda
OS / Linux Distribution
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Installation Method Installed using the steps detailed here: https://docs.sylabs.io/guides/3.8/admin-guide/installation.html.
Additional context
We have the nvidia-container-cli
$ nvidia-container-cli --version
version: 0.9.0+beta1
build date: 2019-06-24T22:00+00:00
build revision: 77c1cbc2f6595c59beda3699ebb9d49a0a8af426
build compiler: aarch64-linux-gnu-gcc-7 7.4.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -g3 -D JETSON=TRUE -DNDEBUG -std=gnu11 -O0 -g3 -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ nvidia-container-cli list --binaries --libraries
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-eglcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-tls.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glsi.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libGLX_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv2_nvidia.so.2
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv1_CM_nvidia.so.1
$ nvidia-container-cli list --ipcs
strace output of the ./scripts/start_singularity.sh
is available here.
Hi @vlk-jan, thanks for the report. On the surface of it, this looks similar to https://github.com/sylabs/singularity/issues/1850. As noted there, the NVIDIA Container CLI is no longer used on Tegra-based systems. There is some hope that the new --oci
mode introduced in Singularity 4.x might help with this, but it has not been confirmed. If you're able to give that a go and report back, it would be appreciated. Thanks!
Hi, thanks for your swift reply.
I do have some updates.
Similarity to previous issue
I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli
. This is why I opened a new issue instead of writing in the old one.
When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container
package).
Odd behavior in binding nv libraries
After some more digging, I found that while the script says: Could not find any nv files on this host!
, all of the libraries from nvidia-container-cli list --binaries --libraries
are bound in the /.singularity.d/libs/
directory, which seems odd.
The log from the execution with -v
and -d
flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted.
The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.
Singularity 4.1
I tried installing singularity in version 4.1 on the Jetson but was unsuccessful. The problem seems to be with the libfuse-dev
as for Ubuntu 18.04, only the libfuse2
is available. Manual installation of libfuse3
failed for some reason.
I may try that again later. But because of that, I do not have any feedback about the --oci
mode for you.
Similarity to previous issue I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the
nvidia-container-cli
. This is why I opened a new issue instead of writing in the old one. When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install ofnvidia-container
package).
Ah, that makes sense. It looks like this was deprecated in v1.10.0 of the NVIDIA Container Toolkit (https://github.com/NVIDIA/nvidia-container-toolkit/issues/90#issuecomment-1673183086), so as you say, that wouldn't be what you're hitting.
Odd behavior in binding nv libraries After some more digging, I found that while the script says:
Could not find any nv files on this host!
, all of the libraries fromnvidia-container-cli list --binaries --libraries
are bound in the/.singularity.d/libs/
directory, which seems odd. The log from the execution with-v
and-d
flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted. The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.
Taking a quick scan through the code of that version of Singularity, it looks like that warning is specifically when no bins
or ipcs
are found:
https://github.com/sylabs/singularity/blob/673570ce999bd3f84458247fcd0698528877cdcd/cmd/internal/cli/actions_linux.go#L347-L351
The libraries are handled separately:
https://github.com/sylabs/singularity/blob/673570ce999bd3f84458247fcd0698528877cdcd/cmd/internal/cli/actions_linux.go#L364-L369
So that looks like it's functioning as expected based on the output you shared from nvidia-container-cli list --binaries --libraries
.
Newer versions of SingularityCE don't use nvidia-container-cli
to find the library list when only the --nv
flag is specified. We only call nvidia-container-cli
if --nvccli
is also specified, in which case it performs container setup, and SingularityCE is not performing the bindings itself.
If you use a current version of SingularityCE, run with --nv
only, and are able to provide a complete list of required libraries in the etc/nvliblist.conf
file then it's likely that the binding will work as expected.
Given the deprecation of nvidia-container-cli for Tegra based systems we aren't going to be able to handle library binding via nvidia-container-cli
. The future of GPU support on Jetson revolves around CDI, which we support in our --oci
mode.
Jetson support for native mode (without --oci) would depend on https://github.com/sylabs/singularity/issues/1395 - so it'd be appropriate to add a comment there if it's important to you.
See also:
- https://github.com/NVIDIA/nvidia-container-toolkit/issues/90#issuecomment-1673183086
- https://github.com/sylabs/singularity/issues/1850