singularity icon indicating copy to clipboard operation
singularity copied to clipboard

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host!

Open vlk-jan opened this issue 10 months ago • 3 comments

Version of Singularity

$ singularity --version
singularity-ce version 3.8.0

Describe the bug When running the singularity image on NVIDIA Jetson AGX, the singularity cannot find nv files.

To Reproduce Steps to reproduce the behavior: We use the singularity image from here: https://github.com/vras-robotour/deploy, on NVIDIA Jetson. Running the following command in the deploy directory

$ ./scripts/start_singularity.sh --nv

=========== STARTING SINGULARITY CONTAINER ============

INFO: Singularity is already installed.
INFO: Updating repository to the latest version.
Already up to date.
INFO: Mounting /snap directory.
INFO: Starting Singularity container from image robotour_arm64.simg.
INFO: Could not find any nv files on this host!
INFO: The catkin workspace is already initialized.

================== UPDATING PACKAGES ==================

INFO: Updating the package naex to the latest version.
Already up to date.
INFO: Updating the package robotour to the latest version.
Already up to date.
INFO: Updating the package map_data to the latest version.
Already up to date.
INFO: Updating the package test_package to the latest version.
Already up to date.

=======================================================

INFO: Starting interactive bash while sourcing the workspace.

Expected behavior Expected behavior is one where the nv files are found and we would be able to use pytorch. with cuda

OS / Linux Distribution

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Installation Method Installed using the steps detailed here: https://docs.sylabs.io/guides/3.8/admin-guide/installation.html.

Additional context We have the nvidia-container-cli

$ nvidia-container-cli --version
version: 0.9.0+beta1
build date: 2019-06-24T22:00+00:00
build revision: 77c1cbc2f6595c59beda3699ebb9d49a0a8af426
build compiler: aarch64-linux-gnu-gcc-7 7.4.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -g3 -D JETSON=TRUE -DNDEBUG -std=gnu11 -O0 -g3 -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ nvidia-container-cli list --binaries --libraries
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-eglcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-tls.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glsi.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libGLX_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv2_nvidia.so.2
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv1_CM_nvidia.so.1
$ nvidia-container-cli list --ipcs

strace output of the ./scripts/start_singularity.sh is available here.

vlk-jan avatar Apr 05 '24 13:04 vlk-jan

Hi @vlk-jan, thanks for the report. On the surface of it, this looks similar to https://github.com/sylabs/singularity/issues/1850. As noted there, the NVIDIA Container CLI is no longer used on Tegra-based systems. There is some hope that the new --oci mode introduced in Singularity 4.x might help with this, but it has not been confirmed. If you're able to give that a go and report back, it would be appreciated. Thanks!

tri-adam avatar Apr 05 '24 15:04 tri-adam

Hi, thanks for your swift reply.

I do have some updates.

Similarity to previous issue I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one. When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Odd behavior in binding nv libraries After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd. The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted. The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Singularity 4.1 I tried installing singularity in version 4.1 on the Jetson but was unsuccessful. The problem seems to be with the libfuse-dev as for Ubuntu 18.04, only the libfuse2 is available. Manual installation of libfuse3 failed for some reason. I may try that again later. But because of that, I do not have any feedback about the --oci mode for you.

vlk-jan avatar Apr 05 '24 21:04 vlk-jan

Similarity to previous issue I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one. When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Ah, that makes sense. It looks like this was deprecated in v1.10.0 of the NVIDIA Container Toolkit (https://github.com/NVIDIA/nvidia-container-toolkit/issues/90#issuecomment-1673183086), so as you say, that wouldn't be what you're hitting.

Odd behavior in binding nv libraries After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd. The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted. The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Taking a quick scan through the code of that version of Singularity, it looks like that warning is specifically when no bins or ipcs are found:

https://github.com/sylabs/singularity/blob/673570ce999bd3f84458247fcd0698528877cdcd/cmd/internal/cli/actions_linux.go#L347-L351

The libraries are handled separately:

https://github.com/sylabs/singularity/blob/673570ce999bd3f84458247fcd0698528877cdcd/cmd/internal/cli/actions_linux.go#L364-L369

So that looks like it's functioning as expected based on the output you shared from nvidia-container-cli list --binaries --libraries.

tri-adam avatar Apr 09 '24 18:04 tri-adam

Newer versions of SingularityCE don't use nvidia-container-cli to find the library list when only the --nv flag is specified. We only call nvidia-container-cli if --nvccli is also specified, in which case it performs container setup, and SingularityCE is not performing the bindings itself.

If you use a current version of SingularityCE, run with --nv only, and are able to provide a complete list of required libraries in the etc/nvliblist.conf file then it's likely that the binding will work as expected.

Given the deprecation of nvidia-container-cli for Tegra based systems we aren't going to be able to handle library binding via nvidia-container-cli. The future of GPU support on Jetson revolves around CDI, which we support in our --oci mode.

Jetson support for native mode (without --oci) would depend on https://github.com/sylabs/singularity/issues/1395 - so it'd be appropriate to add a comment there if it's important to you.

See also:

  • https://github.com/NVIDIA/nvidia-container-toolkit/issues/90#issuecomment-1673183086
  • https://github.com/sylabs/singularity/issues/1850

dtrudg avatar Jun 14 '24 09:06 dtrudg