neuralangelo icon indicating copy to clipboard operation
neuralangelo copied to clipboard

neuralangelo docker run issue - WSL2 + Ubuntu 22.04.3 LTS

Open altava-sgp opened this issue 1 year ago • 5 comments

Previous task

https://github.com/NVlabs/neuralangelo/issues/29

New challenge

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   38C    P8              31W / 450W |   2511MiB / 24564MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

After preparing for the task I run command below. ( https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1681620547 )

docker run --gpus all --ipc=host -it docker.io/chenhsuanlin/neuralangelo:23.04-py3 /bin/bash

I got this error.

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/82c5e50ff0e48ed123838e5e76244fd4306e9332133af2fb4354f03883824ea7/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0000] error waiting for container:

Can I get any help ?

@iam-machine You solved this issue ? ( https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1686849719 )

altava-sgp avatar Aug 22 '23 04:08 altava-sgp

Hey @altava-sgp, I made the docker container run by removing the these files:

FROM chenhsuanlin/neuralangelo:23.04-py3

RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 

Than just run docker build -t fix-neuralangelo:1.0 .

this worked for me. Let me know if it worked for you.

thomasbernhard-dev avatar Aug 22 '23 07:08 thomasbernhard-dev

@thomasbernhard-dev I edited this file. ( https://github.com/NVlabs/neuralangelo/blob/main/docker/Dockerfile-neuralangelo ) I added one line as you told.

RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

I run this command.

docker build -t docker.io/chenhsuanlin/neuralangelo:23.04-py3 -f docker/Dockerfile-neuralangelo .

After long building tile I got success! I run this command again. ( I used IMAGE ID )

docker run --gpus all --ipc=host -it 75a3d4706291 /bin/bash

It works.

root@altava-farer:~/neuralangelo# docker build -t docker.io/chenhsuanlin/neuralangelo:23.04-py3 -f docker/Dockerfile-neuralangelo .
[+] Building 1647.4s (12/12) FINISHED                                                                                     docker:default
 => [internal] load .dockerignore                                                                                                   0.1s
 => => transferring context: 2B                                                                                                     0.0s
 => [internal] load build definition from Dockerfile-neuralangelo                                                                   0.1s
 => => transferring dockerfile: 1.10kB                                                                                              0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:23.04-py3                                                                   3.5s
 => [1/7] FROM nvcr.io/nvidia/pytorch:23.04-py3@sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f             2.5s
 => => resolve nvcr.io/nvidia/pytorch:23.04-py3@sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f             0.0s
 => => sha256:09567441e1c039661761a6970ec213e341aafa9767fd86e58943dde10cda2960 10.20kB / 10.20kB                                    0.0s
 => => sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f 686B / 686B                                          0.0s
 => => sha256:b4428941db4ff8324b1dc65a352d2a9c7d26bd59b2fec322fe103c74fa7eac65 44.61kB / 44.61kB                                    0.0s
 => [internal] load build context                                                                                                   0.1s
 => => transferring context: 413B                                                                                                   0.0s
 => [2/7] RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/  0.4s
 => [3/7] RUN apt-get update && apt-get install -y --no-install-recommends     build-essential     bzip2     ca-certificates       92.1s
 => [4/7] RUN pip install --upgrade pip                                                                                             3.5s
 => [5/7] RUN pip install --upgrade     flake8     pre-commit                                                                      13.8s
 => [6/7] COPY requirements.txt requirements.txt                                                                                    0.1s
 => [7/7] RUN pip install --upgrade -r requirements.txt                                                                          1524.7s
 => exporting to image                                                                                                              6.7s
 => => exporting layers                                                                                                             6.7s
 => => writing image sha256:75a3d47062917ac32cd079fcac96cd881441bc343864d5b23faca9ab2bc3717f                                        0.0s
 => => naming to docker.io/chenhsuanlin/neuralangelo:23.04-py3                                                                      0.0s
root@altava-farer:~/neuralangelo#
root@altava-farer:~/neuralangelo# docker images
REPOSITORY                  TAG         IMAGE ID       CREATED         SIZE
chenhsuanlin/neuralangelo   23.04-py3   75a3d4706291   6 minutes ago   23GB
chenhsuanlin/neuralangelo   <none>      53fe4b1ac32d   9 days ago      23GB
root@altava-farer:~/neuralangelo#
root@altava-farer:~/neuralangelo# docker run --gpus all --ipc=host -it 75a3d4706291 /bin/bash

=============
== PyTorch ==
=============

NVIDIA Release 23.04 (build 58180998)
PyTorch Version 2.1.0a0+fe05266

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@c8390ff32f45:/workspace#

Thanks a lot ! I can try next step.

altava-sgp avatar Aug 22 '23 08:08 altava-sgp

I got same error with this. 😢 https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1681631297

altava-sgp avatar Aug 22 '23 08:08 altava-sgp

@altava-sgp I only made it to the point until it throws the error:

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: out of memory

I think the model is too big for my GPU. So I do not know if I would get the same error message. But did you install: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

for WSL and inside your neuralangelo container: apt install -y cuda-toolkit ?

thomasbernhard-dev avatar Aug 22 '23 13:08 thomasbernhard-dev

@thomasbernhard-dev

I tried as you told. I installed CUDA Toolkit 12.2 on both WSL2 Ubuntu and container. I installed apt install -y cuda-toolkit In container too.

I got the same error. 😢

.
.
.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
.
.
.

I may try pure ubuntu too.

altava-sgp avatar Aug 23 '23 01:08 altava-sgp