ch4/ucx: refactoring AM to use GenQ for pack buffer allocation
Pull Request Description
The goal of this PR is to use GenQ for pack buffer allocation. This avoids the costly allocation of GPU registered host buffer on the fly which destroys the short- and medium-size message latency. #5171
State of main
UCX's implementation of `MPIDI_NM_am_isend` always allocates the pack buffer using `MPL_gpu_malloc_host`
without checking whether the user buffer is on GPU. This function affects PT2PT eager message and all RMA operations
(regardless data size) with a "putting-like" operation (PUT, ACC, GET_ACC).
UCX's implementation of `MPIDI_NM_am_isend_reply` does check the user buffer location and only allocates GPU
registered pack buffer if user buffer is on GPU or data type is non-contig. But because of the GPU registered pack
buffer is allocated in on-demand fashion, the extra cost associated with `cuda_malloc_host` is still paid. This function
affects PT2PT rndv and all RMA operations with a "getting-like" operations (GET, GET_ACC).
Proposed solution
The proposed solution in this PR is using GenQ to get preallocated pack buffer from the pool. This means taking
a design similar to the OFI AM that contains at least two modes: SHORT and PIPELINE. In these two modes, the
fixed-size pack buffer is obtained from a GenQ pool. A deferred request issuing mechanism is used in case the
pool is exhausted.
Commits in this PR
1-2: these are prereqs and cleanup commits. It preserves the existing implementation and named it BULK mode (as all data is packed in a big pack buffer and sent together).
3: Introduce the new SHORT mode where user data smaller than a predefined short message threshold is sent using GenQ pack buffer. ucx_am_impl.h is created to host these codes for clarity.
4: refactoring ucx_am.h to use the SHORT mode. In this PR, the BULK mode is still used by the message larger than the short message threshold.
5: Introducing PIPELINE mode.
6: refactoring ucx_am.h to use PIPELINE mode for large message. At this point BULK mode is not being used any more. But I do recognizing it could be beneficial in some cases where less sending operation is desired (e.g. RMA large messages). I am working on getting the performance numbers for this right now.
7: minor warning fix
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/gpu/ucx test:mpich/ch4/ucx
@raffenet I believed the problem in #5224 is fixed as part of this PR.
I'm trying to find the part of the pipeline implementation that allocates and sends as many GenQ buffers as it can before deferring the rest of the operation, but I'm not finding it. Where should I be looking? It'll help me to understand the whole flow.
There is no explicit loop that "issues as many as possible" in this function. We are relying on the progress loop to call this function (inside ucx_progress.h) for issuing the rest segments.
In fact, I have not studied the performance benefit of issuing multiple segments. I probably should have done that. My premise is the performance being mostly receiver side bound. Also, I think we would not want multi-segment issuing when the progress engine handles the deferred request---the latency of each progress call will defer and may delay other ops that relies on progress loop.
looking into the failures now
In fact, I have not studied the performance benefit of issuing multiple segments. I probably should have done that. My premise is the performance being mostly receiver side bound. Also, I think we would not want multi-segment issuing when the progress engine handles the deferred request---the latency of each progress call will defer and may delay other ops that relies on progress loop.
I assume there will be some performance study before merging. This PR adds a lot of code, and it needs to be justified by performance improvement.
Preliminary result on JLSE with CPU Before:
# Size Latency (us)
0 0.41
1 0.43
2 0.42
4 0.41
8 0.41
16 0.40
32 0.41
64 0.49
128 0.52
256 0.56
512 0.59
1024 0.84
2048 1.06
4096 1.47
8192 2.97
16384 5.96
32768 10.08
65536 19.59
131072 43.60
262144 189.66
524288 396.53
1048576 837.85
2097152 1737.17
4194304 2934.53
After
# Size Latency (us)
0 0.45
1 0.48
2 0.48
4 0.48
8 0.48
16 0.48
32 0.48
64 0.55
128 0.52
256 0.56
512 0.60
1024 0.86
2048 1.07
4096 1.48
8192 3.20
16384 4.53
32768 6.97
65536 11.39
131072 20.18
262144 37.11
524288 70.17
1048576 136.10
2097152 267.61
4194304 528.40
Latency variation below 16KB is mainly because of the CPU frequency variations. We lost the capability to have a stable CPU freq on JLSE.
# Size Bandwidth (MB/s)
1 2.89
2 5.26
4 11.47
8 22.54
16 44.63
32 91.53
64 176.40
128 331.60
256 683.80
512 1289.53
1024 1924.66
2048 3267.87
4096 4647.92
8192 4374.52
16384 3819.45
32768 3233.99
65536 4668.79
131072 3043.34
262144 1419.56
524288 1166.48
1048576 945.37
2097152 838.92
4194304 987.11
After
# Size Bandwidth (MB/s)
1 2.78
2 5.66
4 11.57
8 22.66
16 45.14
32 92.41
64 176.85
128 353.00
256 687.25
512 1296.30
1024 1944.00
2048 3279.46
4096 4670.01
8192 4052.86
16384 4877.33
32768 5434.75
65536 6126.21
131072 7980.89
262144 8008.65
524288 7976.70
1048576 8013.85
2097152 8052.43
4194304 8071.09
The number here is with CPU buffer and MPICH is built without GPU support. The "before" case is we send all data in one ucp_am_send_nb() i.e. BULK mode. The "after" case is we send small messages (<16KB) in one ucp_am_send_nb() and break a large message into 16KB chunks and send with multiple ucp_am_send_nb).
The performance difference is mainly due to the receiver side memcpy that we used to ensure the alignment.
I have confirmed that if the memcpy is removed, there is no difference between BULK and PIPELINE. While removing the memcpy works for this particular test run, it cannot be simply removed because the UCP does not guarantee the alignment of incoming data on the receiver side.
My suggestion is we just remove BULK mode for now---UCX netmod only do SHORT and PIPELINE mode for AM.
What do you think? @hzhou @raffenet
test:mpich/ch4/ucx
test:mpich/ch4/gpu/ucx
The number here is with CPU buffer and MPICH is built without GPU support. The "before" case is we send all data in one
ucp_am_send_nb()i.e. BULK mode. The "after" case is we send small messages (<16KB) in oneucp_am_send_nb()and break a large message into 16KB chunks and send with multipleucp_am_send_nb).The performance difference is mainly due to the receiver side
memcpythat we used to ensure the alignment.I have confirmed that if the
memcpyis removed, there is no difference between BULK and PIPELINE. While removing thememcpyworks for this particular test run, it cannot be simply removed because the UCP does not guarantee the alignment of incoming data on the receiver side.My suggestion is we just remove BULK mode for now---UCX netmod only do SHORT and PIPELINE mode for AM.
What do you think? @hzhou @raffenet
The numbers are hard to deny. I think the direction is good. I'm still curious about sending more pipeline messages at a time rather than returning after each ucp_am_send_nb(), but clearly this is an improvement already in its current form.
test:mpich/ch4/gpu/ucx test:mpich/ch4/ucx
I think the conclusion is we need to use UCX's GPU direct, potentially with a dedicated IPC AM protocol path or wait until UCX am add that feature. With that path, the benefit of genq pack buffer will be bypassed. Thus, I am leaning toward keeping it simple and not adopting this PR.
@hzhou It OK to delay. I need to do some more performance studies. Making this draft for now.
Can we do a smaller change where we get a GenQ buffer if the message will fit in it and allocate one for everything else?
Ultimately I think we should really consider adding rndv support to all AM types at the ch4 level, rather than make each netmod/shmmod figure it out on their own.
There was resistance to AM rndv in the initial ch4 design phase, but the more time passes the more I think it was a mistake. There should be some pretty obvious benefits to coordinating long active message RMA at both the origin and target with knowledge of the operation.