containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] How to share a single GPU with multiple containers

Open robvanderleek opened this issue 6 years ago • 18 comments
trafficstars

Summary

I'd like to share the single GPU of a p3.2xlarge instance with multiple containers in the same task.

Description

In the ECS task definition it's not possible to indicate a single GPU can be shared between containers (or to distribute the GPU resource over multiple containers like with CPU units).

I have multiple containers that require a GPU but not at the same time. Is there a way run them in a single task on the same instance? I've tried leaving the GPU unit resource blank but then the GPU device is not visible to the container.

robvanderleek avatar Mar 07 '19 16:03 robvanderleek

Hey, We don't have support for sharing a single GPU with multiple containers right now. We have marked it as feature request.

shubham2892 avatar Mar 07 '19 22:03 shubham2892

For future reference, my current workaround to have multiple containers share a single GPU:

  1. On a running ECS GPU optimized instance, make nvidia-runtime the default runtime for dockerd by adding --default-runtime nvidia to the OPTIONS variable in /etc/sysconfig/docker
  2. Save the instance to a new AMI
  3. In CloudFormation go the Stack created by the ECS cluster wizard and update the EcsAmiId field in the initial template
  4. Restart your services

Since the default runtime is now nvidia, all containers can access the GPU. You can leave the GPU field empty in the task definition wizard (or set it to 1 for only 1 container to make sure the task is put on a GPU instance).

Major drawback of this workaround is of course forking the standard AMI.

robvanderleek avatar Mar 09 '19 09:03 robvanderleek

@robvanderleek: thanks for outlining this workaround for now =]

adnxn avatar Mar 11 '19 16:03 adnxn

@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it

Jeffwan avatar Jan 28 '20 01:01 Jeffwan

Hi @Jeffwan

Thanks for the notification but we are happy with what ECS offers in general. Our inference cluster is running fine on ECS, although we have a custom AMI with the nvidia-docker hack.

Do you expect this solution to also become available for ECS?

robvanderleek avatar Jan 30 '20 20:01 robvanderleek

@robvanderleek This is implemented like a device plugin in Kubernetes. I doubt it can be used in ECS directly. But overall the GPU sharing theory is similar and I think ECS can adopt a similar solution

Jeffwan avatar Feb 06 '20 21:02 Jeffwan

Hi just checking in to see if this has been made available for ECS yet, or should we continue with the AMI workaround?

vbhakta8 avatar Feb 27 '21 21:02 vbhakta8

Same question. Is there any expectation for when this might happen, or is this just an unfulfilled feature request with no slated plans for a fix at present?

nacitar avatar Apr 13 '21 17:04 nacitar

Thank you @robvanderleek !

We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.

We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.

Here is a working snippet: (grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')

Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.

nacitar avatar Apr 14 '21 19:04 nacitar

Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?

adesgautam avatar Aug 11 '21 11:08 adesgautam

Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?

@adesgautam I know of no changes done on the AWS side to improve this and am still relying upon the workaround I mentioned above... which has been working without issue since it was implemented.

nacitar avatar Aug 12 '21 17:08 nacitar

With Docker version 20.10.7, I also had to pass the NVIDIA_VISIBLE_DEVICES=0 environment variable in order for my container to pick up the GPU

spg avatar May 12 '22 14:05 spg

@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it

Hi, we are interested in the EKS solution but couldn't anything in AWS documentation. Can you please share some links to any kind of documentation you have regarding the EKS solution?

Shurbeski avatar Jun 23 '22 09:06 Shurbeski

Hi, we are experiencing the same issue but with a hybrid environment having GPU on-premise. Do you have any suggestions on it or does the issue persist in this case?

NikiBase avatar Feb 09 '23 15:02 NikiBase

If there are any drawbacks to this solution let me know. I modified the docker sysconfig file in ec2 user data section like this

#!/bin/bash sudo rm /etc/sysconfig/docker echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker sudo systemctl restart docker

It does not require to create a new AMI and for now it seems to work.

makr11 avatar Jun 27 '23 09:06 makr11

Any updates on this?

NayamAmarshe avatar May 29 '24 13:05 NayamAmarshe

Thank you @robvanderleek !

We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.

We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.

Here is a working snippet: (grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')

Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.

This script didn't work for me so I had to do this instead:

 sudo bash -c 'grep -q "^OPTIONS=\"--default-runtime nvidia " /etc/sysconfig/docker && echo "/etc/sysconfig/docker needs no changes" || (sed -i "s/^OPTIONS=\"/OPTIONS=\"--default-runtime nvidia /" /etc/sysconfig/docker && echo "/etc/sysconfig/docker updated to have nvidia runtime as default" && systemctl restart docker && echo "Restarted docker")'

NayamAmarshe avatar Aug 21 '24 15:08 NayamAmarshe

To enable sharing the GPU between services:

  1. Update the user script on the lunch template to be like below (change your Cluster name)

`#!/bin/bash

Set the ECS cluster name

echo "ECS_CLUSTER=test_gpu" >> /etc/ecs/ecs.config  

Remove existing Docker configuration

sudo rm /etc/sysconfig/docker  

Configure Docker to use the NVIDIA runtime by default

echo 'DAEMON_MAXFILES=1048576' | sudo tee -a /etc/sysconfig/docker echo 'OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime=nvidia"' | sudo tee -a /etc/sysconfig/docker echo 'DAEMON_PIDFILE_TIMEOUT=10' | sudo tee -a /etc/sysconfig/docker  

Restart the Docker service to apply changes

sudo systemctl restart docker`

  1. Add to all your task definitions

"environment": [ { "name": "NVIDIA_VISIBLE_DEVICES", "value": "all" } ]

  1. Make sure that you do not add GPU resource requirements to the task definition "resourceRequirements": [ { "value": "1", "type": "GPU" } ]

mshamia avatar Apr 11 '25 23:04 mshamia