mpiT
mpiT copied to clipboard
CUDA test failing
Out of the box I'm seeing ptest.lua
in asyncsgd
fail when I set:
local usecuda = true
I get the following:
$ mpiexec -np 2 luajit ptest.lua
rank 1 is client.
rank 0 is server.
0 use cpu
1 use gpu 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 30740 RUNNING AT code13
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
It looks like the problem is that the storages that are referenced in the asynchronous send/receives are actually CudaStorages which point to GPU memory. Should they in fact be FloatStorages? If I hack each CudaTensor to have a corresponding FloatStorage then it seems to work.
Thanks.
You might not install the mpi correctly, maybe you can try to install the lastest openmpi with the following: ./configure --prefix=$POME/exe/$MPI --with-cuda=$POME/exe/cuda
Sixin