ompi icon indicating copy to clipboard operation
ompi copied to clipboard

btl/ofi: fault tolerance

Open Matthew-Whitlock opened this issue 2 months ago • 0 comments

Tested on a Slingshot 11 cluster with synthetic failures (SIGTERM). I'm not sure how consistent the error code will be across different libfabric backends or types of faults. It may be that we could treat more than just FI_EIO as a lost rank, but the error code documentation is a bit lacking.

Matthew-Whitlock avatar Oct 09 '25 20:10 Matthew-Whitlock