mpiT icon indicating copy to clipboard operation
mpiT copied to clipboard

CUDA test failing

Open willwilliams opened this issue 9 years ago • 1 comments

Out of the box I'm seeing ptest.lua in asyncsgd fail when I set:

local usecuda = true

I get the following:

$ mpiexec -np 2 luajit ptest.lua

rank 1 is client.
rank 0 is server.
0   use cpu
1   use gpu 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30740 RUNNING AT code13
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

It looks like the problem is that the storages that are referenced in the asynchronous send/receives are actually CudaStorages which point to GPU memory. Should they in fact be FloatStorages? If I hack each CudaTensor to have a corresponding FloatStorage then it seems to work.

Thanks.

willwilliams avatar Jan 15 '16 14:01 willwilliams

You might not install the mpi correctly, maybe you can try to install the lastest openmpi with the following: ./configure --prefix=$POME/exe/$MPI --with-cuda=$POME/exe/cuda

Sixin

sixin-zh avatar Jan 16 '16 14:01 sixin-zh