nerf-object-removal icon indicating copy to clipboard operation
nerf-object-removal copied to clipboard

Docker image doesn't work

Open Skyy93 opened this issue 1 year ago • 2 comments

Hello, thank you for your amazing work! I want to try it and used the docker instructions you provided here: https://github.com/nianticlabs/nerf-object-removal/blob/main/docker/README.md

The image builds correctly and runs but when I try your example command i get the following message in the logs:

[2023-07-14 13:53:18,408][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init done
[2023-07-14 13:53:18,627][__main__][CRITICAL] - Prediction failed due to Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination:
Traceback (most recent call last):
  File "bin/predict.py", line 59, in main
    model.to(device)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/core/decorators.py", line 89, in inner_fn
    module = fn(self, *args, **kwargs)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 120, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/conda/envs/object-removal/lib/python3.8/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination

Because of this a following error occurcs

FileNotFoundError: [Errno 2] No such file or directory: '/app/object-removal/experiments/real/001/data/../lama_depth_output_real/000_mask001.png'

and also fails JAX to find a GPU

W0714 13:53:32.249409 140354252236608 xla_bridge.py:363] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I have a RTX 4090 with this driver and cuda version in the docker container: Driver Version: 535.54.03 CUDA Version: 11.8

Could you please look into it? I tried to use another Cuda12.0 Container as base image then the pytorch error resolves but not the JAX error that implies it does not find the GPU.

Thank you

Skyy93 avatar Jul 14 '23 12:07 Skyy93