Hui Zhou
Hui Zhou
> Let me know if we should have another call to discuss. Yes, let's schedule another call.
Had an offline discussion, and here are the notes: * Malleability - dropping process - only happens after a process called `MPI_Session_finalize` and exit. Because `MPI_Session_finalize` is a collective call,...
This may not be the same. In testing PR #6510, we hit on `ch4-ucx-asan`: ``` not ok 1837 - ./pt2pt/rqfreeb 4 --- Directory: ./pt2pt File: rqfreeb Num-procs: 4 Timeout: 180...
Try enable `MPIR_CVAR_DEBUG_SUMMARY=1` and run a basic MPI test program (e.g. `MPI_Init` then `MPI_Finalize`). It should tell you which provider it is using by default. If the default is `sockets`...
> I assume this is saying we're using sockets? > > ``` > ==== Capability set configuration ==== > libfabric provider: sockets - 100.64.0.0/16 > ``` Yes
Tested the reproducer using the main branch on sunspot, running two process on single node, with some printf debugging: ``` [1] num_elements = 128000000 (max 256000000) [0] num_elements = 128000000...
The second part of this issue is -- "MPI_Start doe not start communication" This is an issue of MPI does not provide strong progress guarantee. A strong progress means MPI...
The inter-node comms depend on the actual paths. Ideally, it is supposed to be routed to the native path -- CXI -- and it is supposed to perform RDMA asynchronously....
Will someone from Intel implement the functionality?
> @hzhou I am rebasing the CMA code for testing. Is the IPC cleanup commits still relevant? The IPC cleanup is the main purpose of this PR. You can try...