dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: DinD doesn't allow passing `--gpus` flag

Open mtaran opened this issue 1 year ago • 2 comments

Steps to reproduce

  1. make a repro.dstack.yml with:
type: task
name: my-repro-task
image: dstackai/dind:latest
privileged: true
commands:
  - start-dockerd
  - sleep infinity
resources:
  cpu: 4..
  memory: 6GB..
  gpu:
    count: 1
  1. dstack apply -f repro.dstack.yml -y
  2. in another terminal: dstack attach my-repro-task
  3. in yet another terminal: ssh my-repro-task
  4. in the ssh session, try to run docker run --rm --gpus=all hello-world

Actual behaviour

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

Expected behaviour

The container should run, with access to all the GPUs of the host.

dstack version

0.18.18

Server logs

No response

Additional information

No response

mtaran avatar Oct 17 '24 05:10 mtaran

@mtaran Is the issue still relevant? You mentioned you made it work.

peterschmidt85 avatar Oct 17 '24 06:10 peterschmidt85

As TensorDock is “a marketplace of independent hosts”, the setups are not consistent at all.

In addition to

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

i've got

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nv-fabricmanager: no such file or directory: unknown

and

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-persistenced: no such file or directory: unknown

when requesting instances with the same resources configuration.

NVIDIA/CUDA driver versions also vary:

NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

and

NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4   

on two NVIDIA RTX A4000 instances in the same region.

un-def avatar Oct 18 '24 10:10 un-def

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Nov 18 '24 02:11 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

github-actions[bot] avatar Dec 02 '24 02:12 github-actions[bot]