Michael Heinz

Results 10 comments of Michael Heinz

So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be...

> Should Open MPI issue the flush? So, that's kind of the question I'm struggling with. For PSM3, we've been assuming we were doing sufficient work to maintain consistency and...

> That comment does not match what UCX does, nor the CUDA documentation. Which part?

@jdinan - thanks. You've made the whole thing so much clearer for me. Have you looked at https://github.com/aws/aws-ofi-nccl/pull/152? The reason I ask is that the NCCL maintainers are claiming problems...

> @mwheinz can this issue be closed? No, these problems still exist in the OFI provider. Assigning it to myself since there doesn't seem to be anyone in particular maintaining...

I can't promise to get to this anytime soon but I've added it to my internal bug queue. It's low priority because it's been in the code without complaint since...

Unfortunately I no longer work for that company and I haven't worked on Open MPI since 2021.

ucs_debug_print_backtrace() means that the crash occurred inside the UCX library itself. I would really dislike UCX being the default because it already mistakes OPA hardware for Mellanox hardware and generates...

Okay - rebuilding with the tip of the 4.1.x series I'm not seeing UCX "force its way to the front" any more.