ompi
ompi copied to clipboard
btl/ofi: fault tolerance
Tested on a Slingshot 11 cluster with synthetic failures (SIGTERM). I'm not sure how consistent the error code will be across different libfabric backends or types of faults. It may be that we could treat more than just FI_EIO as a lost rank, but the error code documentation is a bit lacking.