docker run --gpus=all will fail as nvidia-smi not available after bootstrap
During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before nvidia-smi becomes available. As the nvidia drivers are not loaded, the command below will fail:
docker run --net=host --gpus=all ...
from dask_cloudprovider.gcp import GCPCluster
def test_dask_gcp_cluster_gpu():
cluster = GCPCluster(
machine_type="n1-standard-8",
n_workers=1,
filesystem_size=100,
gpu_type="nvidia-tesla-t4",
ngpus=1,
)
cloud-init-output.log
Status: Downloaded newer image for daskdev/dask:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.
Environment:
- Dask version: 2022.9.2
- Python version: 3.10
- Operating System: ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
- Install method (conda, pip, source): pip
The mandatory presence of the --gpus=all flag is also a problem when using container optimized OS (COS). I can run GPU examples in the Ubuntu based CUDA docker images following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#e2e, but the --gpus=all flag is not needed and does not work when using nvidia-container-runtime.
kwargs needed to make COS work, if the --gpus=all flag was not there.
cos_args = {
# Use COS image with an LTS milestone.
# https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#requirements
"source_image": "projects/cos-cloud/global/images/cos-101-lts",
# https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#installing_drivers_through_cloud-init
# This step takes ~2 minutes.
"extra_bootstrap": [
"cos-extensions install gpu",
"mount --bind /var/lib/nvidia /var/lib/nvidia",
"mount -o remount,exec /var/lib/nvidia",
],
"docker_args": " ".join(
[
"--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64",
"--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin",
"--device /dev/nvidia0:/dev/nvidia0",
"--device /dev/nvidia-uvm:/dev/nvidia-uvm",
"--device /dev/nvidiactl:/dev/nvidiactl",
]
),
"bootstrap": False,
}
@siddharthab only Ubuntu is supported currently in dask-cloudprovider.