ADIOS2 SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier)

Describe the bug CP_consolidateDataToRankZero() in source/adios2/toolkit/sst/cp/cp_common.c collects the metadata to rank 0 upon EndStep. In PIConGPU, a single rank's contribution is ~38948 bytes.

On 7000 Frontier nodes with 8 GPUs per node: 38948B*7000*8 = 2080MB

Looking into CP_consolidateDataToRankZero():

    if (Stream->Rank == 0)
    {
        int TotalLen = 0;
        Displs = malloc(Stream->CohortSize * sizeof(*Displs));

        Displs[0] = 0;
        TotalLen = (RecvCounts[0] + 7) & ~7;

        for (int i = 1; i < Stream->CohortSize; i++)
        {
            int RoundUp = (RecvCounts[i] + 7) & ~7;
            Displs[i] = TotalLen;
            TotalLen += RoundUp;
        }

        RecvBuffer = malloc(TotalLen * sizeof(char));
    }

    /*
     * Now we have the receive buffer, counts, and displacements, and
     * can gather the data
     */

    SMPI_Gatherv(Buffer, DataSize, SMPI_CHAR, RecvBuffer, RecvCounts, Displs, SMPI_CHAR, 0,
                 Stream->mpiComm);

Since the Displs is a vector of int, the maximum supported dest buffer size for this method is 2GB.

To Reproduce -- no reproducer --

Expected behavior Some method to handle SST metadata aggregation at large scale

Desktop (please complete the following information):

Frontier, using PR #3588

Additional context I'm setting MarshalMethod = bp5 in SST

Following up Was the issue fixed? Please report back.

Oct 17 '23 11:10 franzpoeschel

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Oct 17 '23 11:10 eisenhauer

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag? Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

Oct 17 '23 11:10 franzpoeschel

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

Oct 17 '23 11:10 franzpoeschel

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag? Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

Unfortunately no, not yet. In BP5Writer.cpp there's code that starts with the comment "Two-step metadata aggregation" that implements this for BP5, but it hasn't been done yet for SST. (Here, we're exploiting some characteristics of BP5 metadata. In particular, many time multiple ranks have identical meta-metadata and we can discard those, keeping only one unique copy. This reduces overall metadata size dramatically, at a cost of having to do aggregation in multiple stages. Norbert implemented a fix for this in the BP5 writer, but probably it should be reworked so that it can be shared between engines that use BP5 serialization. Doing that right (so that we use a simple approach for small scale and only go to more complex measures when necessary) isn't wildly hard, but it's non-trivial (and something I probably can't get to this week).

Oct 17 '23 11:10 eisenhauer

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

No, this should be independent of those changes.

Oct 17 '23 11:10 eisenhauer

In the meantime, I'll try if using this as a workaround might help. This should fix the GatherV call at the cost of a slightly higher latency, but I don't know if there is any 32bit indexing going on later on that will break things again.

Oct 17 '23 13:10 franzpoeschel

In the meantime, I'll try if using this as a workaround might help

I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know.

Oct 17 '23 13:10 eisenhauer

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

On Tue, Oct 17, 2023, 3:19 PM Greg Eisenhauer @.***> wrote:

In the meantime, I'll try if using this https://github.com/ornladios/ADIOS2/commit/853ff0d1c92dc984fd571071282896f04d2c4844 as a workaround might help

I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know.

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/3846#issuecomment-1766404971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYYLPG6WYKALFC3Q5BMIDX72AX5AVCNFSM6AAAAAA6DWCEE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRWGQYDIOJXGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Oct 17 '23 17:10 pnorbert

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

That is absolutely true...

Oct 17 '23 17:10 eisenhauer

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale.

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this. (Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

Oct 18 '23 08:10 franzpoeschel

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

Oct 18 '23 14:10 franzpoeschel

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

Excellent... Adding an issue to address these things #3852 across the board.

Oct 18 '23 15:10 eisenhauer

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale. Great

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this. (Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

Yes, you should absolutely do this. At least currently, all attributes from all ranks are stored and installed by the reader, with duplicates doing nothing. Setting the same attributes on all nodes just adds overhead.

Oct 18 '23 20:10 eisenhauer

ADIOS2 ADIOS2 copied to clipboard

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier)

ADIOS2
ADIOS2 copied to clipboard