mpich ch4/ucx: refactoring AM to use GenQ for pack buffer allocation

Pull Request Description

The goal of this PR is to use GenQ for pack buffer allocation. This avoids the costly allocation of GPU registered host buffer on the fly which destroys the short- and medium-size message latency. #5171

State of main

UCX's implementation of `MPIDI_NM_am_isend` always allocates the pack buffer using `MPL_gpu_malloc_host`
without checking whether the user buffer is on GPU. This function affects PT2PT eager message and all RMA operations
(regardless data size) with a "putting-like" operation (PUT, ACC, GET_ACC).

UCX's implementation of `MPIDI_NM_am_isend_reply` does check the user buffer location and only allocates GPU
registered pack buffer if user buffer is on GPU or data type is non-contig. But because of the GPU registered pack
buffer is allocated in on-demand fashion, the extra cost associated with `cuda_malloc_host` is still paid. This function
affects PT2PT rndv and all RMA operations with a "getting-like" operations (GET, GET_ACC).

Proposed solution

The proposed solution in this PR is using GenQ to get preallocated pack buffer from the pool. This means taking
a design similar to the OFI AM that contains at least two modes: SHORT and PIPELINE. In these two modes, the
fixed-size pack buffer is obtained from a GenQ pool. A deferred request issuing mechanism is used in case the
pool is exhausted.

Commits in this PR

1-2: these are prereqs and cleanup commits. It preserves the existing implementation and named it BULK mode (as all data is packed in a big pack buffer and sent together).

3: Introduce the new SHORT mode where user data smaller than a predefined short message threshold is sent using GenQ pack buffer. ucx_am_impl.h is created to host these codes for clarity.

4: refactoring ucx_am.h to use the SHORT mode. In this PR, the BULK mode is still used by the message larger than the short message threshold.

5: Introducing PIPELINE mode.

6: refactoring ucx_am.h to use PIPELINE mode for large message. At this point BULK mode is not being used any more. But I do recognizing it could be beneficial in some cases where less sending operation is desired (e.g. RMA large messages). I am working on getting the performance numbers for this right now.

7: minor warning fix

Author Checklist

[x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
[ ] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
[ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
[ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

Apr 22 '21 18:04 yfguo

test:mpich/ch4/gpu/ucx test:mpich/ch4/ucx

Apr 22 '21 19:04 yfguo

@raffenet I believed the problem in #5224 is fixed as part of this PR.

Apr 22 '21 19:04 yfguo

I'm trying to find the part of the pipeline implementation that allocates and sends as many GenQ buffers as it can before deferring the rest of the operation, but I'm not finding it. Where should I be looking? It'll help me to understand the whole flow.

There is no explicit loop that "issues as many as possible" in this function. We are relying on the progress loop to call this function (inside ucx_progress.h) for issuing the rest segments.

In fact, I have not studied the performance benefit of issuing multiple segments. I probably should have done that. My premise is the performance being mostly receiver side bound. Also, I think we would not want multi-segment issuing when the progress engine handles the deferred request---the latency of each progress call will defer and may delay other ops that relies on progress loop.

Apr 22 '21 22:04 yfguo

looking into the failures now

Apr 22 '21 22:04 yfguo

In fact, I have not studied the performance benefit of issuing multiple segments. I probably should have done that. My premise is the performance being mostly receiver side bound. Also, I think we would not want multi-segment issuing when the progress engine handles the deferred request---the latency of each progress call will defer and may delay other ops that relies on progress loop.

I assume there will be some performance study before merging. This PR adds a lot of code, and it needs to be justified by performance improvement.

Apr 28 '21 22:04 hzhou

Preliminary result on JLSE with CPU Before:

# Size          Latency (us)
0                       0.41
1                       0.43
2                       0.42
4                       0.41
8                       0.41
16                      0.40
32                      0.41
64                      0.49
128                     0.52
256                     0.56
512                     0.59
1024                    0.84
2048                    1.06
4096                    1.47
8192                    2.97
16384                   5.96
32768                  10.08
65536                  19.59
131072                 43.60
262144                189.66
524288                396.53
1048576               837.85
2097152              1737.17
4194304              2934.53

After

# Size          Latency (us)
0                       0.45
1                       0.48
2                       0.48
4                       0.48
8                       0.48
16                      0.48
32                      0.48
64                      0.55
128                     0.52
256                     0.56
512                     0.60
1024                    0.86
2048                    1.07
4096                    1.48
8192                    3.20
16384                   4.53
32768                   6.97
65536                  11.39
131072                 20.18
262144                 37.11
524288                 70.17
1048576               136.10
2097152               267.61
4194304               528.40

Latency variation below 16KB is mainly because of the CPU frequency variations. We lost the capability to have a stable CPU freq on JLSE.

# Size      Bandwidth (MB/s)
1                       2.89
2                       5.26
4                      11.47
8                      22.54
16                     44.63
32                     91.53
64                    176.40
128                   331.60
256                   683.80
512                  1289.53
1024                 1924.66
2048                 3267.87
4096                 4647.92
8192                 4374.52
16384                3819.45
32768                3233.99
65536                4668.79
131072               3043.34
262144               1419.56
524288               1166.48
1048576               945.37
2097152               838.92
4194304               987.11

After

# Size      Bandwidth (MB/s)
1                       2.78
2                       5.66
4                      11.57
8                      22.66
16                     45.14
32                     92.41
64                    176.85
128                   353.00
256                   687.25
512                  1296.30
1024                 1944.00
2048                 3279.46
4096                 4670.01
8192                 4052.86
16384                4877.33
32768                5434.75
65536                6126.21
131072               7980.89
262144               8008.65
524288               7976.70
1048576              8013.85
2097152              8052.43
4194304              8071.09

The number here is with CPU buffer and MPICH is built without GPU support. The "before" case is we send all data in one ucp_am_send_nb() i.e. BULK mode. The "after" case is we send small messages (<16KB) in one ucp_am_send_nb() and break a large message into 16KB chunks and send with multiple ucp_am_send_nb).

The performance difference is mainly due to the receiver side memcpy that we used to ensure the alignment.

I have confirmed that if the memcpy is removed, there is no difference between BULK and PIPELINE. While removing the memcpy works for this particular test run, it cannot be simply removed because the UCP does not guarantee the alignment of incoming data on the receiver side.

My suggestion is we just remove BULK mode for now---UCX netmod only do SHORT and PIPELINE mode for AM.

What do you think? @hzhou @raffenet

May 13 '21 19:05 yfguo

test:mpich/ch4/ucx

May 13 '21 21:05 yfguo

test:mpich/ch4/gpu/ucx

May 13 '21 21:05 yfguo

The number here is with CPU buffer and MPICH is built without GPU support. The "before" case is we send all data in one ucp_am_send_nb() i.e. BULK mode. The "after" case is we send small messages (<16KB) in one ucp_am_send_nb() and break a large message into 16KB chunks and send with multiple ucp_am_send_nb).

The performance difference is mainly due to the receiver side memcpy that we used to ensure the alignment.

I have confirmed that if the memcpy is removed, there is no difference between BULK and PIPELINE. While removing the memcpy works for this particular test run, it cannot be simply removed because the UCP does not guarantee the alignment of incoming data on the receiver side.

My suggestion is we just remove BULK mode for now---UCX netmod only do SHORT and PIPELINE mode for AM.

What do you think? @hzhou @raffenet

The numbers are hard to deny. I think the direction is good. I'm still curious about sending more pipeline messages at a time rather than returning after each ucp_am_send_nb(), but clearly this is an improvement already in its current form.

May 18 '21 17:05 raffenet

test:mpich/ch4/gpu/ucx test:mpich/ch4/ucx

Jun 12 '21 02:06 yfguo

I think the conclusion is we need to use UCX's GPU direct, potentially with a dedicated IPC AM protocol path or wait until UCX am add that feature. With that path, the benefit of genq pack buffer will be bypassed. Thus, I am leaning toward keeping it simple and not adopting this PR.

Jun 12 '21 15:06 hzhou

@hzhou It OK to delay. I need to do some more performance studies. Making this draft for now.

Jun 12 '21 21:06 yfguo

Can we do a smaller change where we get a GenQ buffer if the message will fit in it and allocate one for everything else?

Jun 14 '21 16:06 raffenet

Ultimately I think we should really consider adding rndv support to all AM types at the ch4 level, rather than make each netmod/shmmod figure it out on their own.

There was resistance to AM rndv in the initial ch4 design phase, but the more time passes the more I think it was a mistake. There should be some pretty obvious benefits to coordinating long active message RMA at both the origin and target with knowledge of the operation.

Jun 14 '21 16:06 raffenet