ompi
ompi copied to clipboard
v4.1.x: ompi/coll/cuda: implement reduce_local
Reduce_local implementation is missing which causes failures in IMB. The implementation piggybacks on existing cuda reduce implementation to stage/unstage send/receive buffers.
bot:notacherrypick