mpich icon indicating copy to clipboard operation
mpich copied to clipboard

requests: provide a way for users to query max number of requests

Open jeffhammond opened this issue 3 years ago • 23 comments

It seems MPICH has a limit of approximately 2^17 (131072) requests now. I know I've used a million in the past, and Open-MPI supports at least that.

Where are such limits documented? It seems to be limited by src/include/mpir_request.h (excerpt below) but I can't derive ~2^17 from this.

/* Handle Bits - 2+4+6+8+12 - Type, Kind, Pool_idx, Block_idx, Object_idx */
#define REQUEST_POOL_MASK    0x03f00000
#define REQUEST_POOL_SHIFT   20
#define REQUEST_POOL_MAX     64
#define REQUEST_BLOCK_MASK   0x000ff000
#define REQUEST_BLOCK_SHIFT  12
#define REQUEST_BLOCK_MAX    256
#define REQUEST_OBJECT_MASK  0x00000fff
#define REQUEST_OBJECT_SHIFT 0
#define REQUEST_OBJECT_MAX   4096

#define REQUEST_NUM_BLOCKS   256
#define REQUEST_NUM_INDICES  1024

#define MPIR_REQUEST_NUM_POOLS REQUEST_POOL_MAX
$ mpicc.mpich  bug.c -o bug.mpich && ./bug.mpich 2>&1 | tail -n20
Isend+Irecv on 131066
Isend+Irecv on 131067
Isend+Irecv on 131068
Isend+Irecv on 131069
Isend+Irecv on 131070
Isend+Irecv on 131071
Isend+Irecv on 131072
Isend+Irecv on 131073
Isend+Irecv on 131074
Isend+Irecv on 131075
Isend+Irecv on 131076
Assertion failed in file ./src/include/mpir_request.h at line 446: req != NULL
/lib/x86_64-linux-gnu/libmpich.so.12(+0x223cdf) [0x7f6718285cdf]
/lib/x86_64-linux-gnu/libmpich.so.12(+0x3138d) [0x7f671809338d]
/lib/x86_64-linux-gnu/libmpich.so.12(MPI_Isend+0x953) [0x7f671815d693]
./bug.mpich(+0x1580) [0x55ebb9b69580]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6717e63d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6717e63e40]
./bug.mpich(+0x1245) [0x55ebb9b69245]
Abort(1) on node 0: Internal error
MPI_Get_library_version = MPICH Version:	4.0
MPICH Release date:	Fri Jan 21 10:42:29 CST 2022
MPICH ABI:	14:0:2
MPICH Device:	ch4:ofi
MPICH configure:	--build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infod
ir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}
/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-libfabric=/usr --with-slur
m=/usr --with-device=ch4:ofi --with-pm=hydra --with-hwloc-prefix=/usr --with-wrapper-dl-type=none --enable-shared --without-yaksa
--prefix=/usr --enable-fortran=all --disable-rpath --disable-wrapper-rpath --sysconfdir=/etc/mpich --libdir=/usr/lib/x86_64-linux-
gnu --includedir=/usr/include/x86_64-linux-gnu/mpich --docdir=/usr/share/doc/mpich CPPFLAGS= CFLAGS= CXXFLAGS= FFLAGS=-O2 -ffile-p
refix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -fall
ow-invalid-boz -fallow-argument-mismatch FCFLAGS=-O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-object
s -flto=auto -ffat-lto-objects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch BASH_SHELL=/bin/bash
MPICH CC:	gcc  -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-ob
jects -fstack-protector-strong -Wformat -Werror=format-security  -O2
MPICH CXX:	g++  -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-ob
jects -fstack-protector-strong -Wformat -Werror=format-security -O2
MPICH F77:	gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-o
bjects -fstack-protector-strong  -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-
4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp  -fallow-invalid-boz -fallow-argumen
t-mismatch -O2
MPICH FC:	gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-o
bjects -fstack-protector-strong  -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-
4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp  -fallow-invalid-boz -fallow-argumen
t-mismatch -O2
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    int rc;
    rc = MPI_Init(&argc,&argv);

    int me, np;
    MPI_Comm_rank(MPI_COMM_WORLD,&me);
    MPI_Comm_size(MPI_COMM_WORLD,&np);
    printf("I am %d of %d\n", me, np);
    fflush(0);
    MPI_Barrier(MPI_COMM_WORLD);


    char version[MPI_MAX_LIBRARY_VERSION_STRING];
    int len;
    MPI_Get_library_version(version, &len);
    if (me==0) printf("MPI_Get_library_version = %s\n", version);
    fflush(0);

    if (1)
    {
        MPI_Barrier(MPI_COMM_WORLD);
        fflush(0);
        if (me==0) printf("Isend+Irecv+Waitall\n");
        fflush(0);

        const int n = 1000000;
        int * buffer = malloc(2 * n * sizeof(int));
        for (int i=0; i<n; i++) {
            buffer[i]   =  i;    // send
            buffer[n+i] = -1000; // recv
        }
        MPI_Request * r = malloc(2 * n * sizeof(MPI_Request));

        for (int i=0; i<n; i++) {
            printf("Isend+Irecv on %d\n", i);
            MPI_Isend(&buffer[i], 1, MPI_INT, me, 99, MPI_COMM_WORLD, &r[i]);
            MPI_Irecv(&buffer[n+i], 1, MPI_INT, me, 99, MPI_COMM_WORLD, &r[n+i]);
        }
        for (int i=0; i<2*n; i++) {
            printf("Wait on %d\n", i);
            MPI_Wait(&r[i],MPI_STATUSES_IGNORE);
        }
        //MPI_Waitall(2*n,r,MPI_STATUSES_IGNORE);

        free(buffer);
        free(r);
    }

    fflush(0);
    MPI_Barrier(MPI_COMM_WORLD);
    if (me==0) printf("all done\n");

    rc = MPI_Finalize();

    return rc;
}

jeffhammond avatar Feb 07 '23 08:02 jeffhammond

As you see, we took 6 bits for multi vci request pools in order to get thread multiple performance. Let me assess the possibility of making the number of pool bits configurable

hzhou avatar Feb 07 '23 14:02 hzhou

I am less interested in configuring it and more interested in being able to figure out what the limit is, so I don't write bad tests. I know I can't do it exactly like MPI_TAG_UB but something like, perhaps in MPI_T, would work.

jeffhammond avatar Feb 07 '23 15:02 jeffhammond

To explain why it is 2^17 --

#define REQUEST_NUM_BLOCKS   256
#define REQUEST_NUM_INDICES  1024

So that is 2^18. You have Isend + Irecv, so that makes total of 2^18.

There is room to define REQUEST_NUM_INDICES 4096, that will give a total number of requests per vci 1 million. Historically we use 1024 sized block to prevent heavy delay when allocating new blocks.

hzhou avatar Mar 03 '23 03:03 hzhou

I am closing this issue as it not clear what we need to do. Feel free to add comment or re-open if necessary.

hzhou avatar Mar 03 '23 03:03 hzhou

@hzhou @jeffhammond Would you please clarify what precisely is the limit on requests? Is this the cumulative number of requests, or the total number of active requests at any one time? Or something else? My program encountered this assertion while using MPICH, and would like to understand it better so that I can resolve the issue in my code.

markmcclure avatar Mar 22 '24 03:03 markmcclure

@markmcclure It is the total number of requests in use at any one time. This may be more than the number of requests visible to the users since there are internal requests used by the library.

hzhou avatar Mar 22 '24 03:03 hzhou

Thank you!

markmcclure avatar Mar 22 '24 03:03 markmcclure

@hzhou the specific request I would make on this issue is for MPICH to provide a way to query the maximum number of live requests permitted at a time so that I can query it. This could be an implementation-defined attribute on MPI_COMM_WORLD (like MPI_WTIME_IS_GLOBAL) or something from MPI_T. It is not a high priority, but as we can now see, I am not the only person to encounter this limit.

I suppose I could argue for knowing the limits on other things too, such as communicators. Many folks over the years have run into the issue of creating too many MPI_Comm handles, although most of those use cases can be accused of being bad applications 😄

jeffhammond avatar Mar 22 '24 07:03 jeffhammond

To close the loop on my comment - the program had MPI_Isend and MPI_Irecv requests that were not being closed out with an MPI_Wait. This issue was usually not noticed, but if the program ran for an unusually long time, then the number of 'open' requests would grow ever larger and eventually reach this limit and trigger the assertion. The solution was to close out the requests with MPI_Wait after they were completed.

markmcclure avatar Mar 22 '24 20:03 markmcclure

We are trying to debug a problem that showed up between MPICH 3.x and 4.x. We are seeing the assert shown in @jeffhammond test code. We don't see a corresponding problem using Open MPI 5.x. Is there some way of querying the total number of live requests at any point in the code execution? Given the assert, it looks like the problem is that a request is being generated without being cleared by a corresponding wait, but we have been unable to identify where this is happening. Being able to keep tabs on the number of outstanding requests would help track this down.

bjpalmer avatar Oct 02 '24 19:10 bjpalmer

We can instrument the code and build a custom mpich so it can, for example, print the number of live request and remaining capacity, would that be useful for you? Because the request objects are performance sensitive, we can't build that as a production feature. But if it is desirable for debugging purpose, we can build it as a configurable debugging feature.

hzhou avatar Oct 04 '24 02:10 hzhou

That sounds like it would work. Could we trigger this output from a function call or would this be something that prints out every time you open or close a request?

bjpalmer avatar Oct 04 '24 17:10 bjpalmer

Can't you do that by writing PMPI interposition of functions that produce and consume requests? https://github.com/LLNL/wrap will autogenerate them? That's going to be a lot easier than modifying MPICH.

jeffhammond avatar Oct 04 '24 17:10 jeffhammond

That is the tricky part. We do not desire to maintain too many non-standard MPI extensions. Thus I am tempted to implement it as a log message that one can enable disable with CVAR. Of course, there will be a flood of logs.

@jeffhammond There are internal requests that can't be captured with PMPI interface.

hzhou avatar Oct 04 '24 17:10 hzhou

Just responding to loop @ajaypanyala and @edoapra into this conversation.

bjpalmer avatar Oct 04 '24 17:10 bjpalmer

On a related note, can you suggest a change in going from 3.4.x to 4.x that might account for this error showing up?

bjpalmer avatar Oct 08 '24 16:10 bjpalmer

On a related note, can you suggest a change in going from 3.4.x to 4.x that might account for this error showing up?

From 3.4.x to 4.x, the default config chnaged from --with-device=ch3 to --with-device=ch4. In particular, the ch4 device enables per-vci request pool, which resulted in reduced total number of requests available in each pool.

hzhou avatar Oct 08 '24 21:10 hzhou

What are the default number of VCIs used by the ch4 config in 4.x ? If we set it to 1, can we expect the same behavior we used to get with 3.4.x ? Thanks!

ajaypanyala avatar Oct 09 '24 17:10 ajaypanyala

In order to support per-VCI request pool, we took 6 bits away from the request handle, thus essentially shrink the number of total requests by 1/64. Let me think about making that configurable. I think we definitely can make it a build option, and potentially we can make it a runtime (init-time) option. Stay tuned.

hzhou avatar Oct 09 '24 17:10 hzhou

Right now we have a parameter in GA that controls the number of outstanding requests. It is set to 256 (per MPI process). Is this too large?

bjpalmer avatar Oct 09 '24 17:10 bjpalmer

@bjpalmer I have seen this error less likely to occur when I set COMEX_MAX_NB_OUTSTANDING=3 https://github.com/GlobalArrays/ga/blob/develop/comex/src-mpi-pr/comex_impl.h

edoapra avatar Oct 09 '24 18:10 edoapra

Does the error still occur at 3 or does it go away completely? There is also a separate parameter NB_MAX_NUM_NB_HDLS in the GA layer that can be set to lower the number of outstanding GA handles. It's located in https://github.com/GlobalArrays/ga/blob/develop/global/src/nbutil.c.

bjpalmer avatar Oct 09 '24 18:10 bjpalmer

Does the error still occur at 3 or does it go away completely? There is also a separate parameter NB_MAX_NUM_NB_HDLS in the GA layer that can be set to lower the number of outstanding GA handles. It's located in https://github.com/GlobalArrays/ga/blob/develop/global/src/nbutil.c.

Let me repeat the test by lowering NB_MAX_NUM_NB_HDLS and keeping COMEX_MAX_NB_OUTSTANDING=3

edoapra avatar Oct 09 '24 18:10 edoapra

@hzhou Could you add aurora tag for tracking reasons

abagusetty avatar Oct 22 '24 15:10 abagusetty

@hzhou Could you add aurora tag for tracking reasons

I'll add the label to the PR.

hzhou avatar Oct 22 '24 15:10 hzhou

PR #7181 doesn't directly address this issue as to provide the query, but it allowed much bigger maximum requests (to 2^25). Hopefully that is beyond any application's need.

hzhou avatar Oct 22 '24 15:10 hzhou