ManiSkill icon indicating copy to clipboard operation
ManiSkill copied to clipboard

NVMLLibraryMismatchError using maniskill/base docker image

Open AlexanderKoch-Koch opened this issue 10 months ago • 9 comments

I'm trying to run ManiSkill within Docker. The official maniskill/base image throws a NVMLLibraryMismatchError. I have solved this issue before outside of docker by using cuda 12.4. The maniskill docker image is based on nvidia/cudagl which seems to be not updated by nvidia anymore. Is there any other way to use ManiSkill with docker?

python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 \

--save-video --render-mode="sensors"

--------------------------------------------------------------------------

Task ID: PickCube-v1, 64 parallel environments, sim_backend=gpu obs_mode=state, control_mode=pd_joint_delta_pos render_mode=sensors, sensor_details=RGBD(128x128) sim_freq=100, control_freq=20 observation space: Box(-inf, inf, (64, 42), float32) (single) action space: Box(-1.0, 1.0, (8,), float32)

--------------------------------------------------------------------------

start recording env.step metrics Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 996, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/opt/conda/lib/python3.9/ctypes/init.py", line 395, in getattr func = self.getitem(name) File "/opt/conda/lib/python3.9/ctypes/init.py", line 400, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2193, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 3150, in nvmlDeviceGetComputeRunningProcesses return nvmlDeviceGetComputeRunningProcesses_v3(handle) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 3117, in nvmlDeviceGetComputeRunningProcesses_v3 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3") File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 999, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.NVMLError_FunctionNotFound: Function Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 163, in main(parse_args()) File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 60, in main with profiler.profile("env.step", total_steps=N, num_envs=num_envs): File "/opt/conda/lib/python3.9/contextlib.py", line 119, in enter return next(self.gen) File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/profiling.py", line 87, in profile gpu_mem_use = self.get_current_process_gpu_memory() File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/profiling.py", line 116, in get_current_process_gpu_memory processes = pynvml.nvmlDeviceGetComputeRunningProcesses(self.handle) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2195, in wrapper raise NVMLLibraryMismatchError("Unversioned function called and the " pynvml.NVMLLibraryMismatchError: Unversioned function called and the pyNVML version does not match the NVML lib version. Either use matching pyNVML and NVML lib versions or use a versioned function such as nvmlDeviceGetComputeRunningProcesses_v2

AlexanderKoch-Koch avatar Mar 06 '25 00:03 AlexanderKoch-Koch

What are your system specs (GPU type, OS, nvidia-smi output)

re the outdated docker image, i was not aware that is no longer maintained. I'll put up an issue about investigating that

StoneT2000 avatar Mar 06 '25 00:03 StoneT2000

Nvidia L4 GPU Debian 11 (GCP image projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11)

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 | | N/A 34C P8 12W / 72W | 4MiB / 23034MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

AlexanderKoch-Koch avatar Mar 06 '25 00:03 AlexanderKoch-Koch

Does maniskill gpu sim scripts work for you outside of docker? I want to first check if this is a docker issue or system issue.

E.g. install latest maniskill (either pypi version mani_skill==3.0.0b19 or install from git) locally (not in docker) and run the same command

python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 --save-video --render-mode="sensors"

StoneT2000 avatar Mar 06 '25 00:03 StoneT2000

I have had some working maniskill installations before. It doesn't work outside docker on this vm with CUDA 12.2. Let me check again with CUDA 12.4

AlexanderKoch-Koch avatar Mar 06 '25 00:03 AlexanderKoch-Koch

It works outside of docker with cuda 12.4 (gcp source image projects/ml-images/global/images/c0-deeplearning-common-cu124-v20241224-debian-11). But now the nvidia drivers don't work at all within docker:

python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n 64 \

--save-video --render-mode="sensors" Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2248, in _LoadNvmlLibrary nvmlLib = CDLL("libnvidia-ml.so.1") File "/opt/conda/lib/python3.9/ctypes/init.py", line 382, in init self._handle = _dlopen(self._name, mode) OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 163, in main(parse_args()) File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/gpu_sim.py", line 18, in main profiler = Profiler(output_format="stdout") File "/opt/conda/lib/python3.9/site-packages/mani_skill/examples/benchmarking/profiling.py", line 32, in init pynvml.nvmlInit() File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2220, in nvmlInit nvmlInitWithFlags(0) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2203, in nvmlInitWithFlags _LoadNvmlLibrary() File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 2250, in _LoadNvmlLibrary _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND) File "/opt/conda/lib/python3.9/site-packages/pynvml.py", line 979, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

AlexanderKoch-Koch avatar Mar 06 '25 01:03 AlexanderKoch-Koch

Ok I am setting this up myself now it is not working on my own machine but magically works on my lab's compute cluster..?

I will investigate this a bit more this week

StoneT2000 avatar Mar 06 '25 01:03 StoneT2000

Ok re the pynvml error that is probably because the docker container is not even detecting the GPU

Can you run the following to check

docker run --rm -it --gpus all maniskill/base nvidia-smi

If that doesn't work can you try running through the https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html install guide and make sure all the commands necessary are run and try checking again.

If it still doesn't work this fixed it for my local machine: "sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false" from https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours

The original image should work, and errors are likely due to some faulty docker and nvidia container toolkit setup.

That being said I have working versions of the docker image built against the updated nvidia cuda images and will probably publish a few of those under their own tags like

maniskill/base:cuda-12.2.2-ubuntu22.04 or something (similar to the nvidia cuda image naming scheme), this way even if your cuda version is different/not supported you can still pick one of the images + give docs on how to easily modify one image for different cuda / linux variants etc.

StoneT2000 avatar Mar 06 '25 07:03 StoneT2000

It works with the --gpus flag. Thank you. Although I'm still a bit confused since my setup worked for other docker containers using the GPU without this flag. Can you share the Dockerfile for your updated image? Are you using a different base image now?

AlexanderKoch-Koch avatar Mar 06 '25 15:03 AlexanderKoch-Koch

That's interesting you can run things without that flag, nvidia container toolkit docs say you have to.

I will create a new issue for the docker images that use the newer cuda/nvidia base images. The current docker image maniskill/base will be kept as it should still work on most systems, we still currently use that one most frequently.

StoneT2000 avatar Mar 06 '25 19:03 StoneT2000