NVMLLibraryMismatchError using maniskill/base docker image
I'm trying to run ManiSkill within Docker. The official maniskill/base image throws a NVMLLibraryMismatchError. I have solved this issue before outside of docker by using cuda 12.4. The maniskill docker image is based on nvidia/cudagl which seems to be not updated by nvidia anymore. Is there any other way to use ManiSkill with docker?
python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 \
--save-video --render-mode="sensors"
--------------------------------------------------------------------------
Task ID: PickCube-v1, 64 parallel environments, sim_backend=gpu obs_mode=state, control_mode=pd_joint_delta_pos render_mode=sensors, sensor_details=RGBD(128x128) sim_freq=100, control_freq=20 observation space: Box(-inf, inf, (64, 42), float32) (single) action space: Box(-1.0, 1.0, (8,), float32)
--------------------------------------------------------------------------
start recording env.step metrics Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 996, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/opt/conda/lib/python3.9/ctypes/init.py", line 395, in getattr func = self.getitem(name) File "/opt/conda/lib/python3.9/ctypes/init.py", line 400, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2193, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 3150, in nvmlDeviceGetComputeRunningProcesses return nvmlDeviceGetComputeRunningProcesses_v3(handle) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 3117, in nvmlDeviceGetComputeRunningProcesses_v3 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3") File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 999, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.NVMLError_FunctionNotFound: Function Not Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 163, in
What are your system specs (GPU type, OS, nvidia-smi output)
re the outdated docker image, i was not aware that is no longer maintained. I'll put up an issue about investigating that
Nvidia L4 GPU Debian 11 (GCP image projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11)
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 | | N/A 34C P8 12W / 72W | 4MiB / 23034MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Does maniskill gpu sim scripts work for you outside of docker? I want to first check if this is a docker issue or system issue.
E.g. install latest maniskill (either pypi version mani_skill==3.0.0b19 or install from git) locally (not in docker) and run the same command
python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 --save-video --render-mode="sensors"
I have had some working maniskill installations before. It doesn't work outside docker on this vm with CUDA 12.2. Let me check again with CUDA 12.4
It works outside of docker with cuda 12.4 (gcp source image projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11). But now the nvidia drivers don't work at all within docker:
python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 \
--save-video --render-mode="sensors" Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2248, in _LoadNvmlLibrary nvmlLib = CDLL("libnvidia-ml.so.1") File "/opt/conda/lib/python3.9/ctypes/init.py", line 382, in init self._handle = _dlopen(self._name, mode) OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 163, in
Ok I am setting this up myself now it is not working on my own machine but magically works on my lab's compute cluster..?
I will investigate this a bit more this week
Ok re the pynvml error that is probably because the docker container is not even detecting the GPU
Can you run the following to check
docker run --rm -it --gpus all maniskill/base nvidia-smi
If that doesn't work can you try running through the https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html install guide and make sure all the commands necessary are run and try checking again.
If it still doesn't work this fixed it for my local machine: "sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false" from https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours
The original image should work, and errors are likely due to some faulty docker and nvidia container toolkit setup.
That being said I have working versions of the docker image built against the updated nvidia cuda images and will probably publish a few of those under their own tags like
maniskill/base:cuda-12.2.2-ubuntu22.04 or something (similar to the nvidia cuda image naming scheme), this way even if your cuda version is different/not supported you can still pick one of the images + give docs on how to easily modify one image for different cuda / linux variants etc.
It works with the --gpus flag. Thank you. Although I'm still a bit confused since my setup worked for other docker containers using the GPU without this flag. Can you share the Dockerfile for your updated image? Are you using a different base image now?
That's interesting you can run things without that flag, nvidia container toolkit docs say you have to.
I will create a new issue for the docker images that use the newer cuda/nvidia base images. The current docker image maniskill/base will be kept as it should still work on most systems, we still currently use that one most frequently.