Michael Heinz
Michael Heinz
Josh, did you mean to open a Jira?
So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be...
> Should Open MPI issue the flush? So, that's kind of the question I'm struggling with. For PSM3, we've been assuming we were doing sufficient work to maintain consistency and...
> That comment does not match what UCX does, nor the CUDA documentation. Which part?
@jdinan - thanks. You've made the whole thing so much clearer for me. Have you looked at https://github.com/aws/aws-ofi-nccl/pull/152? The reason I ask is that the NCCL maintainers are claiming problems...
> @mwheinz can this issue be closed? No, these problems still exist in the OFI provider. Assigning it to myself since there doesn't seem to be anyone in particular maintaining...
I can't promise to get to this anytime soon but I've added it to my internal bug queue. It's low priority because it's been in the code without complaint since...
Unfortunately I no longer work for that company and I haven't worked on Open MPI since 2021.
ucs_debug_print_backtrace() means that the crash occurred inside the UCX library itself. I would really dislike UCX being the default because it already mistakes OPA hardware for Mellanox hardware and generates...
Okay - rebuilding with the tip of the 4.1.x series I'm not seeing UCX "force its way to the front" any more.