docker-deeplearning icon indicating copy to clipboard operation
docker-deeplearning copied to clipboard

For running deep neural net experiment on AWS EC2 GPU instance with Docker and Docker machine


For running deep neural net experiment on AWS EC2 g2.2xlarge / g2.8xlarge with Docker and Docker machine

This image includes

  • Nvidia driver 346.46
  • CUDA 7.0
  • Anaconda 3.18.8 (Python 2.7.11)
  • Preconfigured .theanorc to use GPU and float32 by default

Useful Commands

Preparing the host machine

The host machine needs to run the same version of the NVidia driver as inside the container. So I built an AMI based on the Ubuntu 14.04 HBM SSD AMI (ami-5c207736) by the following script.

sudo su -
apt-get update
apt-get install -y build-essential
apt-get install -y linux-headers-$(uname -r) linux-image-$(uname -r) linux-image-extra-$(uname -r)
echo "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off" > /etc/modprobe.d/blacklist-nouveau.conf 
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
update-initramfs -u

sudo su -
cd /opt
chmod +x cuda_*
./cuda_* -extract=`pwd`/nvidia_installers
cd nvidia_installers
./NVIDIA-Linux-x86_64-*.run -s
./cuda-linux64-rel-*.run -noprompt
./ -noprompt -cudaprefix=/usr/local/cuda
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
ls /dev | grep nvidia

rm /opt/
rm -r /opt/nvidia_installers

You should save the instance as an AMI so you can reuse it later.

To create a host using spot instance

docker-machine create --driver amazonec2 \
    --amazonec2-ami ami-... \
    --amazonec2-access-key $AWS_ACCESS_KEY_ID \
    --amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \
    --amazonec2-vpc-id vpc-... \
    --amazonec2-root-size 60 \
    --amazonec2-instance-type g2.2xlarge \
    --amazonec2-request-spot-instance \
    --amazonec2-spot-price 0.15 \

To activate the newly created instance

eval "$(docker-machine env aws01)"

To view all created hosts

docker-machine ls

SSH into the instance and sanity check

docker-machine ssh aws01
# Should see information about the GPU
ls /dev | grep nvidia
# Should see nvidia0 nvidiactl nvidia-uvm

If nvidia-uvm is not found

docker-machine ssh aws01
ls /dev | grep nvidia

To terminate and remove the instance

docker-machine rm aws01

Running the image

To build this image

docker build -t felixlaumon/deeplearning .

Make sure the GPU is working inside the container

docker run -ti --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm felixlaumon/deeplearning python -c "import theano"
# Should see "Using gpu device 0: GRID K520"

Debug inside the container

docker run -ti --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm felixlaumon/deeplearning /bin/bash

To publish the image

docker push felixlaumon/deeplearning

To start over

docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)
docker rmi $(docker images -q)