azurehpc icon indicating copy to clipboard operation
azurehpc copied to clipboard

Lustreinfiniband

Open chadnar2 opened this issue 5 years ago • 1 comments

  • lustre-ipoib - This is a created implementation of Lustre using ip over infiniband (IPoIB)
  • lustre-rdma - This is a created implementation of Lustre using native Remote Direct Memory Access (RDMA)

lustre_rdma_nvmedrives: Changes to files to enable Infiniband functionality: lfsmaster.sh lfsoss.sh lfsclient.sh lfsrepo.sh lfspkgs.sh

Addition for the installation of new Mellanox OFED (MOFED) for the Lustre kernel : installMOFED.sh

Addition for correct drives placement of OSSes : installdrives.sh *installdrives.sh takes about 15 minutes to run so please either remote this entity, or wait it out.

Additions for correct Lustre kernel : lustreinstall1.sh lustreinstall2.sh

Addition for pause after MDS/OSS reboot : waitforreboot.sh

chadnar2 avatar Jun 05 '20 12:06 chadnar2

I have reduced the size of the headnode to a 'Standard_D8s_v3' since there is no infiniband connectivity with the Lustre servers anyway. These are not HPC images, hence the need to install OFED. MOFED never worked for IB Lustre when we tried to use the HPC image. This may be something to look at. .ssh is now required for the root user too, not just hpcuser. This is because the infiniband addition. The sudo functionality for hpcuser did not work for me, but this was awhile ago. The waagent restart comes from another Lustre script. What do you suggest to actively check if the Lustre kernel has been installed and the node is up? (ssh -q hpcuser@hostname exit ??)

chadnar2 avatar Jul 14 '20 16:07 chadnar2