mercury
mercury copied to clipboard
NA OFI: support vector rma operations
Is your feature request related to a problem? Please describe.
The na_ofi.c already uses libfabric API calls that support vectors for local and remote buffers (fi_writemsg() and fi_readmsg()), but it hard codes the vector lengths to 1 (https://github.com/mercury-hpc/mercury/blob/06bf87e4403dd1e487b53204569e05296e6a5061/src/na/na_ofi.c#L4171).
Describe the solution you'd like
It would be better if it translated vectors from the bulk handle into native vectors for libfabric, as is done in the na_sm.c code here: https://github.com/mercury-hpc/mercury/blob/06bf87e4403dd1e487b53204569e05296e6a5061/src/na/na_sm.c#L3733
We need to confirm if the relevant libfabric providers actually support this capability; if not we will need to retain a fallback path for providers that don't (and add that to the flags checked in the NA_OFI_PROV_TYPES so we can quickly toggle if offending providers gain support later).
Additional context
From a quick scan of libfabric it looks like gni and psm2 likely support this. Verbs may not but we need to confirm. See numerous FI_E2BIG return codes in the verbs provider, but I'm not sure what ocde paths are important here when using rxm.
For some more context, here is code in MPICH that is issuing vector libfabric operations when needed:
https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L514
I don't see guards around the provider type there, though it's possible that I'm missing something.
Yes I had looked at that in the past but it was not supported. See https://github.com/ofiwg/libfabric/issues/3062 We should have a look at it again and maybe have a query call to fall back for providers that do not support it.
I didn't think to check the memory registration path; I just looked at what some of the codes are doing on the actual read/write RMA path. I guess we'll be testing both with this change :)
MPICH appears to be setting msg.desc to NULL:
https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L480
I don't know the implications, just pointing that out.
I ever considered it long time ago. The vector rma should be supported by OFI, but OFI providers with limit setting for the local iov_count and remote rma_iov_count (fi_info::tx_attr::iov_limit/::rma_iov_limit), and different provider can with different limit number can be queried by fi_getinfo(). Just FYI.
@carns All providers set mr_iov_limit and rma_iov_limit to 1, I am still doing some more testing and we'll support it internally now but I would keep hopes very low at this point :\
@carns I was looking in more details at the mpich code that you had sent. I think it's still limited though since they also always set iov_count and rma_iov_count to 1.
Ah, Ok. Thanks for the updates.
Supporting code was added as part of https://github.com/mercury-hpc/mercury/commit/a6bbce8514011e21e5ef1b37283a17606a0d39cb but there are issues remaining with libfabric that will need to be further investigated.
What are the "issues remaning with libfabric" ? Phil pointed me to this post https://lists.openfabrics.org/pipermail/libfabric-users/2021-June/000861.html documenting the FI_VERBS_TX_IOV_LIMIT
and FI_VERBS_RX_IOV_LIMIT
environment variables, and the ibv_devinfo -v
utility. on summit, it looks like we should set those environment variables to 30
But no impact on the vectored transfer benchmark: Before:
<op> <warmup_seconds> <concurrency> <threads> <xfer_size> <vector_count> <total_bytes> <seconds> <MiB/s> <xfer_memory> <align_buffer>
PULL 1 1 0 524288 1 209230757888 20.000072 9976.864320 0
PULL 1 1 0 524288 2 209364451328 20.000095 9983.227515 0
PULL 1 1 0 524288 4 210315509760 20.000066 10028.592124 0
PULL 1 1 0 524288 8 210331238400 20.000078 10029.335904 0
PULL 1 1 0 524288 16 200076689408 20.000072 9540.365654 0
PULL 1 1 0 524288 32 192866156544 20.000097 9196.530270 0
PULL 1 1 0 524288 64 107745378304 20.000236 5137.639306 0
PULL 1 1 0 524288 128 53734277120 20.000313 2562.209957 0
PULL 1 1 0 524288 256 27163885568 20.000662 1295.232153 0
After
PULL 1 1 0 524288 2 203238670336 20.000096 9691.128674 0
PULL 1 1 0 524288 4 202780442624 20.000068 9669.291918 0
PULL 1 1 0 524288 8 202621583360 20.000085 9661.708882 0
PULL 1 1 0 524288 16 193659404288 20.000131 9234.339565 0
PULL 1 1 0 524288 32 185765724160 20.000126 8857.944351 0
PULL 1 1 0 524288 64 112774348800 20.000202 5377.445768 0
PULL 1 1 0 524288 128 56699125760 20.000352 2703.577398 0
PULL 1 1 0 524288 256 28120711168 20.000367 1340.875384 0
I would need to look at it again but I had added support for it, you can turn it on by uncommenting that line: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L196 I think if I recall correctly there were some issues with OFI not transferring things correctly. I don't think I have tried those environment variables though.
No change uncommenting that define:
PULL 1 1 0 524288 1 188446408704 20.000071 8985.793186 0
PULL 1 1 0 524288 2 189848879104 20.000109 9052.650574 0
PULL 1 1 0 524288 4 192025198592 20.000094 9156.432103 0
PULL 1 1 0 524288 8 192173572096 20.000107 9163.500843 0
PULL 1 1 0 524288 16 183590453248 20.000067 8754.245780 0
PULL 1 1 0 524288 32 178335514624 20.000155 8503.634210 0
PULL 1 1 0 524288 64 108015386624 20.000248 5150.511207 0
PULL 1 1 0 524288 128 54678519808 20.000225 2607.245722 0
PULL 1 1 0 524288 256 26795311104 20.000391 1277.675006 0
Here's a graph of the three cases, just to make it more clear:
a-ha! I stuck some debugging into hg_bulk_create
: it tells me
==== bulk: count: 64 max_segments: 1
fi_info -v
gives me this:
domain: 0x0
name: mlx5_1
threading: FI_THREAD_SAFE
control_progress: FI_PROGRESS_AUTO
data_progress: FI_PROGRESS_AUTO
resource_mgmt: FI_RM_ENABLED
av_type: FI_AV_UNSPEC
mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY ]
mr_key_size: 4
cq_data_size: 0
cq_cnt: 16777216
ep_cnt: 262144
tx_ctx_cnt: 1024
rx_ctx_cnt: 1024
max_ep_tx_ctx: 1024
max_ep_rx_ctx: 1024
max_ep_stx_ctx: 0
max_ep_srx_ctx: 8388608
cntr_cnt: 0
mr_iov_limit: 1
that mr_iov_limit
is what i need to adjust: mercury consults that field :
priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
setting the env changes the iov_limit
but the mr_iov_limit
is hard coded to 1 (see fi_verbs.h #define VERBS_MR_IOV_LIMIT
)
Dang, I really thought I had something by changing that define in libfabric, recompiling, and trying again, however mercury still tells me max_segments is 1. fi_info
tells me mr_iov_limit is now 30... so why is mercury insisting it is 1?
Ah ok from what I see it looks like it's because of the rxm provider. When you're running fi_info, are you specifying verbs;ofi_rxm ? or only verbs?
I'm not calling the fi_info()
routine. I'm calling the fi_info
utility, which I think by default dumps all the providers it can find?
The rxm angle is interesting, but I don't see "1" in there. I do see
prov/rxm/src/rxm.h:#define RXM_IOV_LIMIT 4
so I'd expect a "sticky" 4, but I don't know where the 1 comes from.
I figured "what the heck" and hard-coded 30 in mercury's na_ofi_initialize
:
diff --git a/src/na/na_ofi.c b/src/na/na_ofi.c
index f3e6e47d..1152f26b 100644
--- a/src/na/na_ofi.c
+++ b/src/na/na_ofi.c
@@ -3928,7 +3928,9 @@ na_ofi_initialize(na_class_t *na_class, const struct na_info *na_info,
#endif
/* Cache IOV max */
- priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+ //XXX: giant hack. set to 30 and see what happens
+ //priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+ priv->iov_max = 30;
/* Create endpoint */
ret = na_ofi_endpoint_open(priv->domain, node_ptr, src_addr, src_addrlen,
Now vector lengths of 1 are ok, and 32 and higer are ok, but vector lengths of 2-16 give me HG_INVALID_ARG
@roblatham00 I don't think I was implying that you were calling the fi_info()
routine :) there are options that you can pass to the fi_info
utility and one of them is -p
which lets you specify which provider you're querying. I was suggesting that you pass verbs\;ofi_rxm
since otherwise yes it would just dump everything. You're also confusing the MR IOV limit and the regular IOV limit I think. The MR IOV limit is the maximum number of buffers that we can pass to memory registration calls at once, i.e. the maximum number of buffers that may be associated with a single memory region. This is basically what happens when you make a call to HG_Bulk_create()
with N buffers, we register these N buffers if possible so that they appear as a single virtual region. The regular IOV limit on the other hand defines the maximum number of buffers at once that are passed to fi_writev()
or fi_readv()
operations. We currently do not have a mode in mercury that allows for passing multiple buffers at once when doing a bulk transfer if they are not part of the same virtual region. I believe I would need to add another query call then / possibly additional NA callbacks (current NA put/get take only a single MR region) and pass that array of buffers and an array of memory regions. Let me look into this.
Thanks for the libfabric information. You are 100% correct that I am confused by libfabric domains, providers, etc.
Let me be more concrete:
I'm using Phil's benchmark to measure this: https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-vector.c#L243
margo_bulk_create
immediately calls HG_Bulk_create()
: This is a single virtual region, right? We are just pointing to different parts of it?
Right, that code is creating a single registration to represent a vector of memory regions. That entire registration (as a whole, including all vectors) is then transmitted with a bulk transfer operation.
Going by @soumagne ;s explanation, that means that this benchmark is reliant on the mr_iov_limit rather than the iov_limit (thanks for the explanation on that- I was wondering what the exact difference was).
Mercury doesn't have an API to pick out pieces with a contiguous memory registration, or for that matter to construct a vector from different registrations. That could be a neat capability, though. MPICH for example uses this to stitch pre-registered header structs onto data payloads. You could also imagine it being used for 2-phase style aggregation, where one party doesn't know anything except the aggregate size, while the other parties fill in specific portions using vector operations. Right now we couldn't use vectors for that.
I think at the very least, the benchmark could still be relevant if the virtual selection made internally results in several segments being transferred at once (internally registered as separate regions). I think the first step is to enable that and add support for it within the NA layer and then we can see about adding new vector types of APIs to HG Bulk itself.