matex
matex copied to clipboard
Machine Learning Toolkit for Extreme Scale (MaTEx)
I tried emailing [email protected] (for general info, not a bug report), but got a delivery failure.
I got the following errors 2018-07-16 15:27:27.536541: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: Exception: Message truncated, error stack: MPI_Allreduce(855)..................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x2049aaa00, count=256, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce_impl(712)............: MPIR_Allreduce_intra(357)...........: MPIC_Sendrecv(186)..................: MPIDI_CH3U_Request_unpack_uebuf(599): Message truncated; 1536...
I tested the code from _/matex/src/deeplearning/tensorflow/examples/glibc_after_2.19/MNIST/tf_lenet3.py_ with command `python tf_lenet3.py` and I got an error: ``` Traceback (most recent call last): File "tf_lenet3.py", line 17, in mnist = tf.DataSet("MNIST", normalize=255.0)...
I ‘confused with the MPI Allreduce Operator the paper said that MaTEx-TensorFlow use the allreduce ops to synchronize each layer across ranks. I think one AllReduce op is to reduce...
1) We should make it pnetcdf linking optional for folks who want to use CSV or other file formats