h2ogpt icon indicating copy to clipboard operation
h2ogpt copied to clipboard

"Unable to locate package nvidia-container-toolkit" on Debian (Ubuntu) x86_64

Open iamdempa opened this issue 1 year ago • 3 comments

Hi Team,

Nice work and appreciate your efforts on this project 🫡

I am trying to run the Docker container and I had the following issue when executing the command sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base

Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease Hit:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease Hit:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB] Fetched 110 kB in 1s (195 kB/s) Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package nvidia-container-toolkit-base

And the solution I found was to:

wget https://nvidia.github.io/nvidia-docker/gpgkey --no-check-certificate
sudo apt-key add gpgkey
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

This fix the problem but still giving the following error for the command docker run --runtime=nvidia --shm-size=64g -p 7860:7860 -v ${HOME}/.cache:/root/.cache --rm h2o-llm -it generate.py --base_model=EleutherAI/gpt-neox-20b --lora_weights=h2ogpt_lora_weights --prompt_type=human_bot

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Could someone help me on this? I am trying to run the Docker container. Tried with docker compose up but still the same.

iamdempa avatar Apr 19 '23 22:04 iamdempa

Hi, please try the documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

specifically try doing this first:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

This may be required to find the correct packages, and it was missed because on my system I had already done it before perhaps.

Let us know if this fixes it, in meantime I'll update instructions to include this step.

Thanks!

pseudotensor avatar Apr 19 '23 22:04 pseudotensor

Hi @pseudotensor, thank you for the commands. Yes it fixes the earlier problem but still having issues with the latter, which is;

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

Could you also specify the minimum CPU/Memory requirements for a machine to run this Docker container?

Thank you, Best Regards

iamdempa avatar Apr 20 '23 06:04 iamdempa

The system requirements scale with the model size. E.g. 20B requires 4 48GB GPUs for generation unless use 8bit then 2 48GB GPUs is ok.

pseudotensor avatar Apr 22 '23 10:04 pseudotensor

Hi @iamdempa, just checking again if you are still experiencing issues with the latest changes.

If so, I would be happy to help, we typically use the steps here to setup cuda toolkit: https://github.com/h2oai/h2ogpt/blob/main/docs/INSTALL.md#installing-cuda-toolkit but it could happen that under some different pre-conditions on your system the cuda libs are not found, in which case, one can check the /etc/ld.so.conf.d/cuda... and make sure it points to the right location of libnvidia-ml, that is if you can confirm that indeed libnvidia-ml.so.a is installed somewhere on your system (find / -name libnvidia-ml* 2> /dev/null). If you can share the result of the find command, and how the ld cache is setup for your cuda install we debug.

achraf-mer avatar Sep 25 '23 16:09 achraf-mer