compose
                                
                                 compose copied to clipboard
                                
                                    compose copied to clipboard
                            
                            
                            
                        Can't access GPU during build with docker compose v2
Description
Accessing the GPU during build using Docker Compose v2 doesn't work.
It does work when the container is running, but some of my build steps need the GPU for compilation with cuda.
It doesn't seem to work using either runtime/resources flags as described here
This does work using docker compose v1.
Steps to reproduce the issue:
- docker-compose v2 doesn't build
The attached yml + Dockerfile fail with an AssertionError.
docker compose build nvidia-test
docker compose build nvidia-test-2
- docker-compose v1 works
Running with docker-compose v1 installed via pip, the attached yml and Dockerfiles run successfully.
docker-compose build nvidia-test
docker-compose build nvidia-test-2
Output of docker compose version:
v2
docker compose version
Docker Compose version v2.6.0
v1
docker-compose version
docker-compose version 1.29.2, build unknown
docker-py version: 5.0.3
CPython version: 3.9.4
OpenSSL version: OpenSSL 1.1.1k  25 Mar 2021
Output of docker info:
Client:                                                  
 Context:    default                                                                                              
 Debug Mode: false                                                                                                
 Plugins:                                                                                                         
  app: Docker App (Docker Inc., v0.9.1-beta3)                                                                     
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)                                                              
  compose: Docker Compose (Docker Inc., v2.6.0)                                                                   
  scan: Docker Scan (Docker Inc., v0.17.0)                                                                        
                                                                                                                  
Server:                                                                                                           
 Containers: 34                                          
  Running: 2                                             
  Paused: 0            
  Stopped: 32
 Images: 31                                                                                                       
 Server Version: 20.10.17                                
 Storage Driver: overlay2                                                                                         
  Backing Filesystem: extfs                              
  Supports d_type: true 
  Native Overlay Diff: true
  userxattr: false                                       
 Logging Driver: json-file                                                                                        
 Cgroup Driver: cgroupfs                                                                                          
 Cgroup Version: 1                                                                                                
 Plugins:         
  Volume: local                                          
  Network: bridge host ipvlan macvlan null overlay       
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog                             
 Swarm: inactive                                         
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia                                       
 Default Runtime: nvidia                                 
 Init Binary: docker-init                                
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.15.0-1015-aws
 Operating System: Ubuntu 20.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.34GiB
 Name: ip-172-31-33-172
 ID: 7QW3:4AFO:BJBD:IH6R:IXVA:WWW2:Z5EL:HRH4:E4Y4:MFZD:KUWE:VH75
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
Dockerfile
FROM pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime
RUN python -c "import torch;assert torch.cuda.is_available()"
docker-compose.yml
version: "3.9"
services:
  nvidia-test:
    build: ./
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]
  nvidia-test-2:
    build: ./
    runtime: nvidia
We experience the same issue. This is currently holding us back from making the transition to compose v2 and the cli plugin.
Can you try running without buildkit and see if the result is any different?
DOCKER_BUILDKIT=0 docker compose build nvidia-test
No, disabling buildkit gives the same error. Specifically it gives:
ERROR: CUDA initialization failure with error 35.
Setting the "default-runtime" in /etc/docker/daemon.json and using compose v1 the same machine can init cuda without problems during build steps.
P.S.: The initial author of the issue has "nvidia" as the default runtime as well. I don't understand how this doesn't apply to compose v2 if it applies to compose v1.
Btw, we would be very happy to get rid of the default runtime setting. The only issue is that this has been the only reliable solution in the past years to get GPU support into the containers, as this issue proves again today.
To clarify what we tried:
compose v2.6 + runc default runtime + deploy>resources>devices>gpu in YML + DOCKER_BUILDKIT=0 docker compose build -> cuda init error
compose v1 + nvidia default runtime + docker-compose build -> success
I am experiencing the same problem
Related issues:
https://github.com/moby/buildkit/issues/1436 (adding GPUs to run commands), and https://github.com/moby/buildkit/issues/2485 (adding alternative runtimes to buildkit)
Tbh I feel like putting this in the dockerfile is the right way to fix this.
deploy>resources>devices>gpu (as the naming implies) defined the resources allocated to run container, not to build.
Can you please try running build with DOCKER_BUILDKIT=0 docker compose build? This will use the "classic" builder, which doesn't involves buildkit
I'm also having this problem and disabling BuildKit by DOCKER_BUILDKIT=0 solves this strange problem for me. Isn't there any other way to fix this?
DOCKER_BUILDKIT=0 solves this issue for me as well though it would be nice to have a reference in the documentation for it.
I'm probably being a noob here but is there a way to set DOCKER_BUILDKIT=0 in the docker-compose.yml file for that specific service, instead of adding it to the docker compose up command?
This isn't really a solution. I want to use buildkit, it provides cache volumes which speed up builds a lot.
Right now I'm building with docker-compose and running the containers with docker compose, works for now.
@danielgafni Are you saying that by using docker-compose this problem can be averted and we can ALSO use buildkit ?
Exactly (this is literally the original issue lol).
@danielgafni buildkit doesn't support GPU devices (yet) see https://github.com/moby/buildkit/issues/1436 and https://github.com/moby/buildkit/issues/2485
I'm closing this issue as same issue applies to plain docker build once buildkit has been set as default builder (which is the case in Docker Desktop). Docker Compose will obviously add support for GPU when building image once this feature is available on buildkit