neuralangelo
neuralangelo copied to clipboard
neuralangelo docker run issue - WSL2 + Ubuntu 22.04.3 LTS
Previous task
https://github.com/NVlabs/neuralangelo/issues/29
New challenge
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 0% 38C P8 31W / 450W | 2511MiB / 24564MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 20 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
After preparing for the task I run command below. ( https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1681620547 )
docker run --gpus all --ipc=host -it docker.io/chenhsuanlin/neuralangelo:23.04-py3 /bin/bash
I got this error.
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/82c5e50ff0e48ed123838e5e76244fd4306e9332133af2fb4354f03883824ea7/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0000] error waiting for container:
Can I get any help ?
@iam-machine You solved this issue ? ( https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1686849719 )
Hey @altava-sgp, I made the docker container run by removing the these files:
FROM chenhsuanlin/neuralangelo:23.04-py3
RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
Than just run docker build -t fix-neuralangelo:1.0 .
this worked for me. Let me know if it worked for you.
@thomasbernhard-dev I edited this file. ( https://github.com/NVlabs/neuralangelo/blob/main/docker/Dockerfile-neuralangelo ) I added one line as you told.
RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
I run this command.
docker build -t docker.io/chenhsuanlin/neuralangelo:23.04-py3 -f docker/Dockerfile-neuralangelo .
After long building tile I got success! I run this command again. ( I used IMAGE ID )
docker run --gpus all --ipc=host -it 75a3d4706291 /bin/bash
It works.
root@altava-farer:~/neuralangelo# docker build -t docker.io/chenhsuanlin/neuralangelo:23.04-py3 -f docker/Dockerfile-neuralangelo .
[+] Building 1647.4s (12/12) FINISHED docker:default
=> [internal] load .dockerignore 0.1s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile-neuralangelo 0.1s
=> => transferring dockerfile: 1.10kB 0.0s
=> [internal] load metadata for nvcr.io/nvidia/pytorch:23.04-py3 3.5s
=> [1/7] FROM nvcr.io/nvidia/pytorch:23.04-py3@sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f 2.5s
=> => resolve nvcr.io/nvidia/pytorch:23.04-py3@sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f 0.0s
=> => sha256:09567441e1c039661761a6970ec213e341aafa9767fd86e58943dde10cda2960 10.20kB / 10.20kB 0.0s
=> => sha256:5dd0caf52947719ba4fc170e779cfb20a5ecac7c91ca530f2884ed35fb97005f 686B / 686B 0.0s
=> => sha256:b4428941db4ff8324b1dc65a352d2a9c7d26bd59b2fec322fe103c74fa7eac65 44.61kB / 44.61kB 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 413B 0.0s
=> [2/7] RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/ 0.4s
=> [3/7] RUN apt-get update && apt-get install -y --no-install-recommends build-essential bzip2 ca-certificates 92.1s
=> [4/7] RUN pip install --upgrade pip 3.5s
=> [5/7] RUN pip install --upgrade flake8 pre-commit 13.8s
=> [6/7] COPY requirements.txt requirements.txt 0.1s
=> [7/7] RUN pip install --upgrade -r requirements.txt 1524.7s
=> exporting to image 6.7s
=> => exporting layers 6.7s
=> => writing image sha256:75a3d47062917ac32cd079fcac96cd881441bc343864d5b23faca9ab2bc3717f 0.0s
=> => naming to docker.io/chenhsuanlin/neuralangelo:23.04-py3 0.0s
root@altava-farer:~/neuralangelo#
root@altava-farer:~/neuralangelo# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
chenhsuanlin/neuralangelo 23.04-py3 75a3d4706291 6 minutes ago 23GB
chenhsuanlin/neuralangelo <none> 53fe4b1ac32d 9 days ago 23GB
root@altava-farer:~/neuralangelo#
root@altava-farer:~/neuralangelo# docker run --gpus all --ipc=host -it 75a3d4706291 /bin/bash
=============
== PyTorch ==
=============
NVIDIA Release 23.04 (build 58180998)
PyTorch Version 2.1.0a0+fe05266
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
root@c8390ff32f45:/workspace#
Thanks a lot ! I can try next step.
I got same error with this. 😢 https://github.com/NVlabs/neuralangelo/issues/29#issuecomment-1681631297
@altava-sgp I only made it to the point until it throws the error:
terminate called after throwing an instance of 'c10::Error' what(): CUDA error: out of memory
I think the model is too big for my GPU. So I do not know if I would get the same error message. But did you install: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local
for WSL and inside your neuralangelo container: apt install -y cuda-toolkit
?
@thomasbernhard-dev
I tried as you told.
I installed CUDA Toolkit 12.2 on both WSL2 Ubuntu and container.
I installed apt install -y cuda-toolkit
In container too.
I got the same error. 😢
.
.
.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
.
.
.
I may try pure ubuntu too.