[Bug]: DinD doesn't allow passing `--gpus` flag
Steps to reproduce
- make a
repro.dstack.ymlwith:
type: task
name: my-repro-task
image: dstackai/dind:latest
privileged: true
commands:
- start-dockerd
- sleep infinity
resources:
cpu: 4..
memory: 6GB..
gpu:
count: 1
dstack apply -f repro.dstack.yml -y- in another terminal:
dstack attach my-repro-task - in yet another terminal:
ssh my-repro-task - in the ssh session, try to run
docker run --rm --gpus=all hello-world
Actual behaviour
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.
Expected behaviour
The container should run, with access to all the GPUs of the host.
dstack version
0.18.18
Server logs
No response
Additional information
No response
@mtaran Is the issue still relevant? You mentioned you made it work.
As TensorDock is “a marketplace of independent hosts”, the setups are not consistent at all.
In addition to
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.
i've got
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nv-fabricmanager: no such file or directory: unknown
and
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-persistenced: no such file or directory: unknown
when requesting instances with the same resources configuration.
NVIDIA/CUDA driver versions also vary:
NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2
and
NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4
on two NVIDIA RTX A4000 instances in the same region.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.