amazon-dsstne icon indicating copy to clipboard operation
amazon-dsstne copied to clipboard

Multi-GPU CIFAR10

Open pspitler3 opened this issue 8 years ago • 1 comments

Hey Everyone,

I was trying to get a dense example of dsstne working so I used the code provided in the cifar-10 directory. I was able to get the data built using the code in the comments of the dparse.cpp file and was able to get it running successfully on a single GPU. The following commands successfully ran the code on both a g2.2xlarge and g2.8xlarge:

train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10

mpirun --allow-run-as-root -np 1 train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10

However, when I changed the code to run on 2 or more GPUs the code errored out (on g2.8xlarge).

The code that I ran:

mpirun --allow-run-as-root -np 2 train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10

The error message I received:

NNLayer::Allocate: Allocating 524288 bytes (512, 256) of unit data for layer Hidden10
[a53ad98f9463:00065] *** Process received signal ***
[a53ad98f9463:00065] Signal: Floating point exception (8)
[a53ad98f9463:00065] Signal code: Integer divide-by-zero (1)
[a53ad98f9463:00065] Failing at address: 0x7f3acc78891a
[a53ad98f9463:00065] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f3ac6461330]
[a53ad98f9463:00065] [ 1] /usr/lib/x86_64-linux-gnu/libcudnn.so.5(+0x2f91a)[0x7f3acc78891a]
[a53ad98f9463:00065] [ 2] /usr/lib/x86_64-linux-gnu/libcudnn.so.5(+0xbf27d)[0x7f3acc81827d]
[a53ad98f9463:00065] [ 3] /usr/lib/x86_64-linux-gnu/libcudnn.so.5(cudnnGetConvolutionForwardWorkspaceSize+0x627)[0x7f3acc7880f7]
[a53ad98f9463:00065] [ 4] /usr/lib/x86_64-linux-gnu/libcudnn.so.5(+0x99a75)[0x7f3acc7f2a75]
[a53ad98f9463:00065] [ 5] train[0x522634]
[a53ad98f9463:00065] [ 6] train[0x478883]
[a53ad98f9463:00065] [ 7] train[0x47e442]
[a53ad98f9463:00065] [ 8] train[0x408dac]
[a53ad98f9463:00065] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f3ac60adf45]
[a53ad98f9463:00065] [10] train[0x40aef1]
[a53ad98f9463:00065] *** End of error message ***
NNLayer::Allocate: Allocating 524288 bytes (512, 256) of delta data for layer Hidden10
NNLayer::Allocate: Allocating 524288 bytes (512, 256) of dropout data for layer Hidden10
NNLayer::Allocate: Deallocating all data for layer Output
NNLayer::Allocate: Allocating 5120 bytes (5, 256) of unit data for layer Output
NNLayer::Allocate: Allocating 5120 bytes (5, 256) of delta data for layer Output
Getting algorithm between Input and C1
Output layer C1 has incorrectly calculated dimensions for cuDNN.
GpuContext::Shutdown: Shutting down cuBLAS on GPU for process 0
GpuContext::Shutdown: CuBLAS shut down on GPU for process 0
GpuContext::Shutdown: Shutting down cuDNN on GPU for process 0
GpuContext::Shutdown: CuDNN shut down on GPU for process 0
GpuContext::Shutdown: Shutting down cuRand on GPU for process 0
GpuContext::Shutdown: CuRand shut down on GPU for process 0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 65 on node a53ad98f9463 exited on signal 8 (Floating point exception).

Please let me know what I should do to fix the code.

Thanks! -Pierce

pspitler3 avatar Jan 26 '17 16:01 pspitler3

Hey Pierce, Stick to single GPU for now, the multi-GPU edition of this is a work-in-progress but I suspect you'll be happy with the results, a reimplementation of Krizhevsky's one weird trick customized for P2P GPUs.

On Thu, Jan 26, 2017 at 8:42 AM, Pierce Spitler [email protected] wrote:

Hey Everyone,

I was trying to get a dense example of dsstne working so I used the code provided in the cifar-10 directory. I was able to get the data built using the code in the comments of the dparse.cpp file and was able to get it running successfully on a single GPU. The following commands successfully ran the code on both a g2.2xlarge and g2.8xlarge:

train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10 mpirun --allow-run-as-root -np 1 train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10

However, when I changed the code to run on 2 or more GPUs the code errored out (on g2.8xlarge).

The code that I ran: mpirun --allow-run-as-root -np 2 train -c config.json -i cifar10_training.nc -o cifar10_training.nc -n gl.nc -b 256 -e 10 The error message I received: `NNLayer::Allocate: Allocating 524288 bytes (512, 256) of unit data for layer Hidden10 [a53ad98f9463:00065] *** Process received signal *** [a53ad98f9463:00065] Signal: Floating point exception (8) [a53ad98f9463:00065] Signal code: Integer divide-by-zero (1) [a53ad98f9463:00065] Failing at address: 0x7f3acc78891a [a53ad98f9463:00065] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[ 0x7f3ac6461330] [a53ad98f9463:00065] [ 1] /usr/lib/x86_64-linux-gnu/ libcudnn.so.5(+0x2f91a)[0x7f3acc78891a] [a53ad98f9463:00065] [ 2] /usr/lib/x86_64-linux-gnu/ libcudnn.so.5(+0xbf27d)[0x7f3acc81827d] [a53ad98f9463:00065] [ 3] /usr/lib/x86_64-linux-gnu/libcudnn.so.5( cudnnGetConvolutionForwardWorkspaceSize+0x627)[0x7f3acc7880f7] [a53ad98f9463:00065] [ 4] /usr/lib/x86_64-linux-gnu/ libcudnn.so.5(+0x99a75)[0x7f3acc7f2a75] [a53ad98f9463:00065] [ 5] train[0x522634] [a53ad98f9463:00065] [ 6] train[0x478883] [a53ad98f9463:00065] [ 7] train[0x47e442] [a53ad98f9463:00065] [ 8] train[0x408dac] [a53ad98f9463:00065] [ 9] /lib/x86_64-linux-gnu/libc.so. 6(__libc_start_main+0xf5)[0x7f3ac60adf45] [a53ad98f9463:00065] [10] train[0x40aef1] [a53ad98f9463:00065] *** End of error message *** NNLayer::Allocate: Allocating 524288 bytes (512, 256) of delta data for layer Hidden10 NNLayer::Allocate: Allocating 524288 bytes (512, 256) of dropout data for layer Hidden10 NNLayer::Allocate: Deallocating all data for layer Output NNLayer::Allocate: Allocating 5120 bytes (5, 256) of unit data for layer Output NNLayer::Allocate: Allocating 5120 bytes (5, 256) of delta data for layer Output Getting algorithm between Input and C1 Output layer C1 has incorrectly calculated dimensions for cuDNN. GpuContext::Shutdown: Shutting down cuBLAS on GPU for process 0 GpuContext::Shutdown: CuBLAS shut down on GPU for process 0 GpuContext::Shutdown: Shutting down cuDNN on GPU for process 0 GpuContext::Shutdown: CuDNN shut down on GPU for process 0 GpuContext::Shutdown: Shutting down cuRand on GPU for process 0 GpuContext::Shutdown: CuRand shut down on GPU for process 0

mpirun noticed that process rank 1 with PID 65 on node a53ad98f9463 exited on signal 8 (Floating point exception).`

Please let me know what I should do to fix the code.

Thanks! -Pierce

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/125, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9nr5278I6K7tSBwkMHROmW6zgQ9Fks5rWMzsgaJpZM4Lu2jQ .

scottlegrand avatar Jan 27 '17 04:01 scottlegrand