nerf-object-removal
nerf-object-removal copied to clipboard
Docker image doesn't work
Hello, thank you for your amazing work! I want to try it and used the docker instructions you provided here: https://github.com/nianticlabs/nerf-object-removal/blob/main/docker/README.md
The image builds correctly and runs but when I try your example command i get the following message in the logs:
[2023-07-14 13:53:18,408][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init done
[2023-07-14 13:53:18,627][__main__][CRITICAL] - Prediction failed due to Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination:
Traceback (most recent call last):
File "bin/predict.py", line 59, in main
model.to(device)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/core/decorators.py", line 89, in inner_fn
module = fn(self, *args, **kwargs)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 120, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination
Because of this a following error occurcs
FileNotFoundError: [Errno 2] No such file or directory: '/app/object-removal/experiments/real/001/data/../lama_depth_output_real/000_mask001.png'
and also fails JAX to find a GPU
W0714 13:53:32.249409 140354252236608 xla_bridge.py:363] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I have a RTX 4090 with this driver and cuda version in the docker container: Driver Version: 535.54.03 CUDA Version: 11.8
Could you please look into it? I tried to use another Cuda12.0 Container as base image then the pytorch error resolves but not the JAX error that implies it does not find the GPU.
Thank you