containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] Add support for GPU with Docker 19.03

Open mfortin opened this issue 4 years ago • 12 comments

Summary

Docker 19.03 now has built-in support for GPU, there is no need to specify an alternate runtime. However, at run time, --gpus all or a specific set of GPUs needs to be passed as an argument and it can't be done with dockerd config.

Description

Without --gpus all

$ docker run --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.

With --gpus all

$ docker run --gpus all --runtime runc --rm nvidia/cuda:10.1-base nvidia-smi
Thu Aug 29 13:54:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:0F.0 Off |                    0 |
| N/A   35C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:10.0 Off |                    0 |
| N/A   32C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:00:11.0 Off |                    0 |
| N/A   40C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:00:12.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:00:13.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:00:14.0 Off |                    0 |
| N/A   34C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:00:15.0 Off |                    0 |
| N/A   40C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:00:16.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla K80           On   | 00000000:00:17.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla K80           On   | 00000000:00:18.0 Off |                    0 |
| N/A   31C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla K80           On   | 00000000:00:19.0 Off |                    0 |
| N/A   37C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla K80           On   | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla K80           On   | 00000000:00:1C.0 Off |                    0 |
| N/A   34C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla K80           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Expected Behavior

Containers with GPU requirements should start

Observed Behavior

Environment Details

$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 1
  Paused: 0
  Stopped: 1
 Images: 4
 Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: splunk
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.27.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 64
 Total Memory: 720.3GiB
 Name: ip-10-45-8-153.us-west-2.compute.internal
 ID: GA4Z:BCED:2FQG:AUKO:KUAX:7X5W:SBAR:NWB3:IHCH:6HQN:TIFW:PLOB
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 31
  Goroutines: 51
  System Time: 2019-08-29T13:55:51.172917082Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true
$ curl http://localhost:51678/v1/metadata
{"Cluster":"BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17","ContainerInstanceArn":"arn:aws:ecs:us-west-2::container-instance/BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17/50e649e34b83423189684b82669a1cea","Version":"Amazon ECS Agent - v1.30.0 (02ff320c)"}

Supporting Log Snippets

Logs can be provided on request.

mfortin avatar Aug 29 '19 13:08 mfortin