nvtop icon indicating copy to clipboard operation
nvtop copied to clipboard

Wrap NVTOP in docker (Impossible to initialize nvidia nvml)

Open chichivica opened this issue 5 years ago • 11 comments

Hi guys, thanks for awesome tool. Could you give an example how to wrap nvtop in docker?

Unfortunately this one:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    rm -rf /work/*


RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1



RUN cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop


CMD ["/usr/local/bin/nvtop"]

Results in:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Impossible to initialize nvidia nvml : 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

When I try to run with:

docker run --runtime=nvidia nvtop

Any ideas?

chichivica avatar May 14 '19 12:05 chichivica

@chichivica I did it in my repository and uploaded the image to my dockerhub you can use it by following command:

docker run --runtime nvidia --rm -ti 69guitar1015/nvtop

cafeal avatar Aug 21 '19 04:08 cafeal

@chichivica, you forgot to remove the stub .so symlinks after building in the Dockerfile. I was able to build the current nvtop version with the following Dockerfile:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

ENTRYPOINT ["/usr/local/bin/nvtop"]

RuRo avatar Nov 22 '19 15:11 RuRo

Thanks @RuRo. It's worked

lamhoangtung avatar Jan 24 '20 05:01 lamhoangtung

I'm trying to do this in conjunction with the tensorflow dockerfile and it isn't working.

The problem seems to be that libnvidia-ml is in a different location: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.430.50

I tried modifying the dockerfile as follows, but no luck.

FROM tensorflow/tensorflow:2.2.0rc3-gpu

RUN apt-get update && apt-get install -y --no-install-recommends \
    bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
    libsox-fmt-all sox libsox-dev \
    tmux zsh vim wget git \
    nano google-perftools \
    cmake libncurses5-dev libncursesw5-dev

RUN ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. -DNVML_RETRIEVE_HEADER_ONLINE=True && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

I get:

CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find NVML (missing: NVML_INCLUDE_DIRS)
Call Stack (most recent call first):
  /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  cmake/modules/FindNVML.cmake:52 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:31 (find_package)

If I add the option -DNVML_RETRIEVE_HEADER_ONLINE=True, I get:

make[2]: *** No rule to make target '/usr/local/lib/libnvidia-ml.so', needed by 'src/nvtop'.  Stop.
make[1]: *** [src/CMakeFiles/nvtop.dir/all] Error 2

Any ideas?

lminer avatar Apr 23 '20 17:04 lminer

@lminer you don't need the real libnvidia-ml.so file, you need the stubs. AFAIK, attempting to use the actual Nvidia shared objects during docker build will always fail, because the shared objects shouldn't actually be inside the container. Instead, they are mounted by the Nvidia Runtime from the host (you can tell by the driver version 430.50 in the so filename). docker build doesn't use the Nvidia Runtime by default, so the actual so files won't be there during the build.

It seems, that the tensorflow folks decided that they will use the nvidia/cuda:*-base-* images, which only have the bare minimum required to use GPUs and that they will provide every build dependency themselves. The base and runtime images don't have any stubs, so you are out of luck.

You'll either have to build tensorflow on your own with nvidia/cuda:*-devel-* as a base image or to provide your own stub so files. Well, maybe I am missing some third option, but eh.

RuRo avatar Apr 23 '20 21:04 RuRo

@RuRo Thanks for such a comprehensive explainer. I'll give that a shot!

lminer avatar Apr 23 '20 22:04 lminer

@RuRo

I was able to build the current nvtop version with the following Dockerfile:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

ENTRYPOINT ["/usr/local/bin/nvtop"]

Thank you for providing your Dockerfile. I changed base image from nvidia/cuda to nvidia/cuda:10.1-devel-ubuntu16.04, successfully built image and when I run it, I get following error:

/usr/local/bin/nvtop: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

Edit: Oops. Forgot about --runtime nvidia.

VictorAtPL avatar Aug 03 '20 13:08 VictorAtPL

Trying this with cuda 11.0 and am running into issues again. Now the stub files aren't present. Is there something that I should be installing that I haven't installed?

Basically /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so doesn't exist and I get

CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find NVML (missing: NVML_INCLUDE_DIRS)
Call Stack (most recent call first):
  /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  cmake/modules/FindNVML.cmake:52 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:31 (find_package)

Here's the dockerfile

ARG UBUNTU_VERSION=18.04

ARG ARCH=
ARG CUDA=11.0
FROM nvidia/cuda${ARCH:+-$ARCH}:${CUDA}-base-ubuntu${UBUNTU_VERSION} as base
# ARCH and CUDA are specified again because the FROM directive resets ARGs
# (but their default value is retained if set previously)

ARG ARCH
ARG CUDA
ARG CUDNN=8.0.4.30-1
ARG CUDNN_MAJOR_VERSION=8
ARG LIB_DIR_PREFIX=x86_64
ARG LIBNVINFER=7.1.3-1
ARG LIBNVINFER_MAJOR_VERSION=7

# Needed for string substitution
SHELL ["/bin/bash", "-c"]

RUN apt-get update && apt-get install -y --no-install-recommends \
    apt-utils \
    build-essential \
    cuda-command-line-tools-${CUDA/./-} \
    libcublas-${CUDA/./-} \
    cuda-nvrtc-${CUDA/./-} \
    libcufft-${CUDA/./-} \
    libcurand-${CUDA/./-} \
    libcusolver-${CUDA/./-} \
    libcusparse-${CUDA/./-} \
    curl \
    libcudnn8=${CUDNN}+cuda${CUDA} \
    libfreetype6-dev \
    libhdf5-serial-dev \
    libzmq3-dev \
    pkg-config \
    software-properties-common \
    unzip

# Install TensorRT if not building for PowerPC
RUN [[ "${ARCH}" = "ppc64le" ]] || { apt-get update && \
    apt-get install -y --no-install-recommends libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
    libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*; }

# For CUDA profiling, TensorFlow requires CUPTI.
ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Link the libcuda stub to the location where tensorflow is searching for it and reconfigure
# dynamic linker run-time bindings
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/z-cuda-stubs.conf \
    && ldconfig

RUN apt-get update && apt-get install -y --no-install-recommends \
    bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
    libsox-fmt-all sox libsox-dev htop python3 \
    tmux zsh vim wget git git-lfs \
    nano google-perftools unzip \
    cmake libncurses5-dev libncursesw5-dev python3-dev

# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8

SHELL ["/usr/bin/zsh", "-c"]

# install nvtop
RUN ln -s /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

lminer avatar Nov 10 '20 21:11 lminer

@lminer As I already mentioned, nvidia/cuda:*-base-* images don't have stubs. You'll have to build with nvidia/cuda:*-devel-* or manually add stubs to the base image.

RuRo avatar Nov 10 '20 22:11 RuRo

Wow you're right. Sorry about that. Thanks for being so patient with me.

lminer avatar Nov 10 '20 23:11 lminer

Now that this repository contains a pre-made dockerfile, this should probably be closed.

qwertychouskie avatar Dec 15 '23 03:12 qwertychouskie