Wei Zhang

Results 17 comments of Wei Zhang

I did not see any information from libfabric printed, which makes me wonder whether openmpi was compiled correctly with libfabric. How did you obtain open mpi? Did you compile by...

it is possible that open mpi was not configured or compiled with libfabric correctly. Because you have libfaric, I assume you used EFA installer to install it. Can you try...

I see. Can you run the command `ompi_info` that is par of the openmpi you are using, and paste the result?

Hi, I noticed that the open mpi you are using is not configured with libfabric. ``` Configure command line: '--prefix=/public/software/.local/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-cuda=no' '--with-hwloc=/public/software/.local/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0' '--with-libevent=/public/software/.local/easybuild/software/libevent/2.1.12-GCCcore-10.2.0' '--with-ofi=/public/software/.local/easybuild/software/libfabric/1.11.0-GCCcore-10.2.0' '--with-pmix=/public/software/.local/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0' '--with-ucx=/public/software/.local/easybuild/software/UCX/1.9.0-GCCcore-10.2.0'...

> thanks,Can I install EFA to a shared disk by changing the script path? No, it always install open mpi to `/opt/amazon`. Note that to use EFA you will need...

Hi, does the compute node (such as c-96-4-worker0002) has EFA installer installed on it?

@bwbarrett According to @jdinan in this discussion https://github.com/aws/aws-ofi-nccl/pull/152, it seems that both CU_POINTER_ATTRIBUTE_SYNC_MEMOPS and flush are needed to ensure data consistency. I think it is reasonable to expect libfabric to...