requests: provide a way for users to query max number of requests
It seems MPICH has a limit of approximately 2^17 (131072) requests now. I know I've used a million in the past, and Open-MPI supports at least that.
Where are such limits documented? It seems to be limited by src/include/mpir_request.h (excerpt below) but I can't derive ~2^17 from this.
/* Handle Bits - 2+4+6+8+12 - Type, Kind, Pool_idx, Block_idx, Object_idx */
#define REQUEST_POOL_MASK 0x03f00000
#define REQUEST_POOL_SHIFT 20
#define REQUEST_POOL_MAX 64
#define REQUEST_BLOCK_MASK 0x000ff000
#define REQUEST_BLOCK_SHIFT 12
#define REQUEST_BLOCK_MAX 256
#define REQUEST_OBJECT_MASK 0x00000fff
#define REQUEST_OBJECT_SHIFT 0
#define REQUEST_OBJECT_MAX 4096
#define REQUEST_NUM_BLOCKS 256
#define REQUEST_NUM_INDICES 1024
#define MPIR_REQUEST_NUM_POOLS REQUEST_POOL_MAX
$ mpicc.mpich bug.c -o bug.mpich && ./bug.mpich 2>&1 | tail -n20
Isend+Irecv on 131066
Isend+Irecv on 131067
Isend+Irecv on 131068
Isend+Irecv on 131069
Isend+Irecv on 131070
Isend+Irecv on 131071
Isend+Irecv on 131072
Isend+Irecv on 131073
Isend+Irecv on 131074
Isend+Irecv on 131075
Isend+Irecv on 131076
Assertion failed in file ./src/include/mpir_request.h at line 446: req != NULL
/lib/x86_64-linux-gnu/libmpich.so.12(+0x223cdf) [0x7f6718285cdf]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x3138d) [0x7f671809338d]
/lib/x86_64-linux-gnu/libmpich.so.12(MPI_Isend+0x953) [0x7f671815d693]
./bug.mpich(+0x1580) [0x55ebb9b69580]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6717e63d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6717e63e40]
./bug.mpich(+0x1245) [0x55ebb9b69245]
Abort(1) on node 0: Internal error
MPI_Get_library_version = MPICH Version: 4.0
MPICH Release date: Fri Jan 21 10:42:29 CST 2022
MPICH ABI: 14:0:2
MPICH Device: ch4:ofi
MPICH configure: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infod
ir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}
/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-libfabric=/usr --with-slur
m=/usr --with-device=ch4:ofi --with-pm=hydra --with-hwloc-prefix=/usr --with-wrapper-dl-type=none --enable-shared --without-yaksa
--prefix=/usr --enable-fortran=all --disable-rpath --disable-wrapper-rpath --sysconfdir=/etc/mpich --libdir=/usr/lib/x86_64-linux-
gnu --includedir=/usr/include/x86_64-linux-gnu/mpich --docdir=/usr/share/doc/mpich CPPFLAGS= CFLAGS= CXXFLAGS= FFLAGS=-O2 -ffile-p
refix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -fall
ow-invalid-boz -fallow-argument-mismatch FCFLAGS=-O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-object
s -flto=auto -ffat-lto-objects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch BASH_SHELL=/bin/bash
MPICH CC: gcc -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-ob
jects -fstack-protector-strong -Wformat -Werror=format-security -O2
MPICH CXX: g++ -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-ob
jects -fstack-protector-strong -Wformat -Werror=format-security -O2
MPICH F77: gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-o
bjects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-
4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp -fallow-invalid-boz -fallow-argumen
t-mismatch -O2
MPICH FC: gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-o
bjects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-
4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp -fallow-invalid-boz -fallow-argumen
t-mismatch -O2
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rc;
rc = MPI_Init(&argc,&argv);
int me, np;
MPI_Comm_rank(MPI_COMM_WORLD,&me);
MPI_Comm_size(MPI_COMM_WORLD,&np);
printf("I am %d of %d\n", me, np);
fflush(0);
MPI_Barrier(MPI_COMM_WORLD);
char version[MPI_MAX_LIBRARY_VERSION_STRING];
int len;
MPI_Get_library_version(version, &len);
if (me==0) printf("MPI_Get_library_version = %s\n", version);
fflush(0);
if (1)
{
MPI_Barrier(MPI_COMM_WORLD);
fflush(0);
if (me==0) printf("Isend+Irecv+Waitall\n");
fflush(0);
const int n = 1000000;
int * buffer = malloc(2 * n * sizeof(int));
for (int i=0; i<n; i++) {
buffer[i] = i; // send
buffer[n+i] = -1000; // recv
}
MPI_Request * r = malloc(2 * n * sizeof(MPI_Request));
for (int i=0; i<n; i++) {
printf("Isend+Irecv on %d\n", i);
MPI_Isend(&buffer[i], 1, MPI_INT, me, 99, MPI_COMM_WORLD, &r[i]);
MPI_Irecv(&buffer[n+i], 1, MPI_INT, me, 99, MPI_COMM_WORLD, &r[n+i]);
}
for (int i=0; i<2*n; i++) {
printf("Wait on %d\n", i);
MPI_Wait(&r[i],MPI_STATUSES_IGNORE);
}
//MPI_Waitall(2*n,r,MPI_STATUSES_IGNORE);
free(buffer);
free(r);
}
fflush(0);
MPI_Barrier(MPI_COMM_WORLD);
if (me==0) printf("all done\n");
rc = MPI_Finalize();
return rc;
}
As you see, we took 6 bits for multi vci request pools in order to get thread multiple performance. Let me assess the possibility of making the number of pool bits configurable
I am less interested in configuring it and more interested in being able to figure out what the limit is, so I don't write bad tests. I know I can't do it exactly like MPI_TAG_UB but something like, perhaps in MPI_T, would work.
To explain why it is 2^17 --
#define REQUEST_NUM_BLOCKS 256
#define REQUEST_NUM_INDICES 1024
So that is 2^18. You have Isend + Irecv, so that makes total of 2^18.
There is room to define REQUEST_NUM_INDICES 4096, that will give a total number of requests per vci 1 million. Historically we use 1024 sized block to prevent heavy delay when allocating new blocks.
I am closing this issue as it not clear what we need to do. Feel free to add comment or re-open if necessary.
@hzhou @jeffhammond Would you please clarify what precisely is the limit on requests? Is this the cumulative number of requests, or the total number of active requests at any one time? Or something else? My program encountered this assertion while using MPICH, and would like to understand it better so that I can resolve the issue in my code.
@markmcclure It is the total number of requests in use at any one time. This may be more than the number of requests visible to the users since there are internal requests used by the library.
Thank you!
@hzhou the specific request I would make on this issue is for MPICH to provide a way to query the maximum number of live requests permitted at a time so that I can query it. This could be an implementation-defined attribute on MPI_COMM_WORLD (like MPI_WTIME_IS_GLOBAL) or something from MPI_T. It is not a high priority, but as we can now see, I am not the only person to encounter this limit.
I suppose I could argue for knowing the limits on other things too, such as communicators. Many folks over the years have run into the issue of creating too many MPI_Comm handles, although most of those use cases can be accused of being bad applications 😄
To close the loop on my comment - the program had MPI_Isend and MPI_Irecv requests that were not being closed out with an MPI_Wait. This issue was usually not noticed, but if the program ran for an unusually long time, then the number of 'open' requests would grow ever larger and eventually reach this limit and trigger the assertion. The solution was to close out the requests with MPI_Wait after they were completed.
We are trying to debug a problem that showed up between MPICH 3.x and 4.x. We are seeing the assert shown in @jeffhammond test code. We don't see a corresponding problem using Open MPI 5.x. Is there some way of querying the total number of live requests at any point in the code execution? Given the assert, it looks like the problem is that a request is being generated without being cleared by a corresponding wait, but we have been unable to identify where this is happening. Being able to keep tabs on the number of outstanding requests would help track this down.
We can instrument the code and build a custom mpich so it can, for example, print the number of live request and remaining capacity, would that be useful for you? Because the request objects are performance sensitive, we can't build that as a production feature. But if it is desirable for debugging purpose, we can build it as a configurable debugging feature.
That sounds like it would work. Could we trigger this output from a function call or would this be something that prints out every time you open or close a request?
Can't you do that by writing PMPI interposition of functions that produce and consume requests? https://github.com/LLNL/wrap will autogenerate them? That's going to be a lot easier than modifying MPICH.
That is the tricky part. We do not desire to maintain too many non-standard MPI extensions. Thus I am tempted to implement it as a log message that one can enable disable with CVAR. Of course, there will be a flood of logs.
@jeffhammond There are internal requests that can't be captured with PMPI interface.
Just responding to loop @ajaypanyala and @edoapra into this conversation.
On a related note, can you suggest a change in going from 3.4.x to 4.x that might account for this error showing up?
On a related note, can you suggest a change in going from 3.4.x to 4.x that might account for this error showing up?
From 3.4.x to 4.x, the default config chnaged from --with-device=ch3 to --with-device=ch4. In particular, the ch4 device enables per-vci request pool, which resulted in reduced total number of requests available in each pool.
What are the default number of VCIs used by the ch4 config in 4.x ? If we set it to 1, can we expect the same behavior we used to get with 3.4.x ? Thanks!
In order to support per-VCI request pool, we took 6 bits away from the request handle, thus essentially shrink the number of total requests by 1/64. Let me think about making that configurable. I think we definitely can make it a build option, and potentially we can make it a runtime (init-time) option. Stay tuned.
Right now we have a parameter in GA that controls the number of outstanding requests. It is set to 256 (per MPI process). Is this too large?
@bjpalmer I have seen this error less likely to occur when I set COMEX_MAX_NB_OUTSTANDING=3
https://github.com/GlobalArrays/ga/blob/develop/comex/src-mpi-pr/comex_impl.h
Does the error still occur at 3 or does it go away completely? There is also a separate parameter NB_MAX_NUM_NB_HDLS in the GA layer that can be set to lower the number of outstanding GA handles. It's located in https://github.com/GlobalArrays/ga/blob/develop/global/src/nbutil.c.
Does the error still occur at 3 or does it go away completely? There is also a separate parameter NB_MAX_NUM_NB_HDLS in the GA layer that can be set to lower the number of outstanding GA handles. It's located in https://github.com/GlobalArrays/ga/blob/develop/global/src/nbutil.c.
Let me repeat the test by lowering NB_MAX_NUM_NB_HDLS and keeping COMEX_MAX_NB_OUTSTANDING=3
@hzhou Could you add aurora tag for tracking reasons
@hzhou Could you add
auroratag for tracking reasons
I'll add the label to the PR.
PR #7181 doesn't directly address this issue as to provide the query, but it allowed much bigger maximum requests (to 2^25). Hopefully that is beyond any application's need.