argobots icon indicating copy to clipboard operation
argobots copied to clipboard

Hit assert in ABTI_mem_pool_alloc()

Open NiuYawei opened this issue 3 years ago • 13 comments

I've seen a couple of CI failures with this assertion failure, but do not have a good reproducer. This is running on Azure, so under docker on a shared VM, and I expect there to be extreme CPU and memory pressure in these cases.

ERROR: daos_engine:0 daos_engine: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertionnum_headers_in_cur_bucket >= 1' failed. ERROR: daos_engine:0 *** Process 43149 received signal 6 *** Associated errno: Success (0) /lib64/libpthread.so.0(+0x12b20)[0x7f660fdc8b20] /lib64/libc.so.6(gsignal+0x10f)[0x7f660f1767ff] /lib64/libc.so.6(abort+0x127)[0x7f660f160c35] /lib64/libc.so.6(+0x21b09)[0x7f660f160b09] /lib64/libc.so.6(+0x2fde6)[0x7f660f16ede6] /opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x10bf2)[0x7f660fb9ebf2] /opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(ABT_thread_create+0x92)[0x7f660fb9eda2] /opt/daos/bin/daos_engine[0x44b49f] /opt/daos/bin/daos_engine[0x44af6e] /opt/daos/bin/daos_engine(dss_ult_create+0x45)[0x44ada5] /opt/daos/bin/daos_engine[0x417e20] /opt/daos/bin/daos_engine[0x417a2b] /opt/daos/bin/daos_engine[0x4174f5] /opt/daos/bin/daos_engine[0x417105] /opt/daos/bin/daos_engine(drpc_progress+0x27e)[0x4165ee] /opt/daos/bin/daos_engine[0x415622] /opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17dba)[0x7f660fba5dba] /opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17f51)[0x7f660fba5f51] DEBUG 21:05:28.522056 procmon.go:246: Cleaning Pool f04361ee-06fe-4c34-8ecc-8f1dd3a55c49 failed:pool evict failed: rpc error: code = Unknown desc = failed to send 92B message: dRPC recv: EOF instance 0, pid 43149, rank 0 exited with status: /opt/daos/bin/daos_engine exited: signal: aborted (core dumped) `

NiuYawei avatar May 17 '21 03:05 NiuYawei

Thank you for reporting an issue!

The pool structure looks broken. This should not happen if the algorithm works correctly. This pool uses a bit complicated logic (#183), but our CI (including numerous OSs, compilers, and CPU architectures) never encountered this issue so far (see https://www.argobots.org/tests/ to know the combinations). I haven't tested the combination of Azure/docker/VM, though. Regarding the line number of assert(), I believe you are using the latest stable Argobots 1.1.

  1. Could you tell me configure options to build Argobots?
  2. Could you tell me information about architecture (x86/64?) and compilers (icc?)?
  3. I received several issues regarding stack overflow (see #274). Does it happen even if you use a larger stack size?
    • ABT_THREAD_STACKSIZE=XXX or ABT_ENV_THREAD_STACKSIZE=XXX can change the default stack size.
    • It is not included in Argobots 1.1, but the current main branch also includes active stack smash detection (see #327).

I would really appreciate a reproducer to investigate this issue, even if the reproducing code is not small.

shintaro-iwasaki avatar May 17 '21 03:05 shintaro-iwasaki

Thanks for looking into this.

  1. The build option looks like: './autogen.sh', './configure --prefix=$ARGOBOTS_PREFIX CC=gcc' ' --enable-valgrind' ' --enable-stack-unwind', 'make $JOBS_OPT', 'make $JOBS_OPT install'],

  2. The arch is x86/64 and compiler is gcc.

  3. We used default stack size for most ULTs and only use larger stack size for few particular ULTs which requires large stack sizes. (so we never changed default stack size through the env var). For this particular issue, it happened on an ULT with default stack size, I'm not sure if enlarging stack size could solve the problem, since we can't reliable reproduce the problem, but I'll ask engineers to try it out.

Many thanks, I'll keep you informed if there is any new findings.

NiuYawei avatar May 17 '21 06:05 NiuYawei

This was actually hit in our testsuite, we're trying to see what we can achieve in github-actions and this is one of the failures that we saw there, an example run is here:

https://github.com/ashleypittman/daos/runs/2567827884

Generally running under github-actions hasn't been that stable for us, we've found a few issues that all seem to relate to resource starvation or timeouts which is not entirely unexpected given the constraints.

We've since trimmed back the PR in question to a core set of functionality and landed it, but I can expand it again to see if I can hit upon a more reliable reproducer. Argobots is built from your v1.1 tag.

I'll create another PR to reproduce the settings I was using before to see if I can trigger this again - it was regularly occurring for a couple of days for me last week.

ashleypittman avatar May 17 '21 08:05 ashleypittman

Thank you for your replies.

The arch is x86/64 and compiler is gcc.

I thought Argobots might encounter a bug in 128-bit atomic CAS, which is used for this memory pool algorithm, but a widely used compiler (e.g., GCC) + x86/64 should not cause an issue. This feature is checked in autogen.sh and also has a fallback implementation.

https://github.com/pmodels/argobots/blob/main/src/include/asm/abtd_asm_int128_cas.h#L20-L36

https://github.com/ashleypittman/daos/runs/2567827884

Thank you. It is very helpful! We will investigate this issue, but as the program is large, please do not expect that I can find a bug very soon.

Regarding resource management, Argobots 1.1 fixed error handling paths, so Argobots itself should properly return resource allocation errors (e.g., memory allocation failure in this memory pool) to the user application unless the error is catastrophic. Those paths should be well tested (#309).

I'll create another PR to reproduce the settings I was using before to see if I can trigger this again - it was regularly occurring for a couple of days for me last week.

Thanks! Tag v1.1 of Argobots has not been updated since March 31, so it would be helpful to know which commits directly reveal this issue (that potentially existed in Argobots).

shintaro-iwasaki avatar May 17 '21 13:05 shintaro-iwasaki

I could not reproduce this issue as far as I checked 4-5 times https://github.com/shintaro-iwasaki/daos-copy/pull/1

I will write a heavily threaded program and check this memory pool implementation in Argobots, but at this point I would suspect either a ULT stack overflow or a bug (e.g., illegal memory access) in DAOS.

shintaro-iwasaki avatar May 24 '21 21:05 shintaro-iwasaki

I am hitting this assert on Summit. This is running with ASAN, which is reporting no errors ahead of the assert failure.

dspaces_server: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion `num_headers_in_cur_bucket >= 1' failed. [h35n03:06804] *** Process received signal *** [h35n03:06804] Signal: Aborted (6) [h35n03:06804] Signal code: (-6) [h35n03:06804] [ 0] /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/spectrum-mpi-10.3.1.2-20200121-jd4wr7r4th5gtr4qndday6gkbvqziasp/container/../lib/libopen-pal.so.3(+0x9222c)[0x20000194222c] [h35n03:06804] [ 1] [0x2000000504d8] [h35n03:06804] [ 2] /lib64/libc.so.6(abort+0x2b4)[0x2000010b2094] [h35n03:06804] [ 3] /lib64/libc.so.6(+0x356d4)[0x2000010a56d4] [h35n03:06804] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000010a57c4] [h35n03:06804] [ 5] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x14b78)[0x200000cf4b78] [h35n03:06804] [ 6] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(ABT_thread_create+0xa8)[0x200000cf4d18] [h35n03:06804] [ 7] /gpfs/alpine/scratch/pdavis/csc143/dspaces/build.3/lib64/libdspaces-server.so.2(_handler_for_ss_rpc+0xf4)[0x200000b6b038] [h35n03:06804] [ 8] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(+0x6350)[0x200000c16350] [h35n03:06804] [ 9] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(+0x15080)[0x200000c25080] [h35n03:06804] [10] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(HG_Core_trigger+0x24)[0x200000c2cfb4] [h35n03:06804] [11] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(HG_Trigger+0x28)[0x200000c1a078] [h35n03:06804] [12] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mochi-margo-0.9.4-iq7m5zqw6jlxm32sqp3pu6dbdhmuzux2/lib/libmargo.so.0(__margo_hg_progress_fn+0x74)[0x200000baa4f4] [h35n03:06804] [13] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x1da18)[0x200000cfda18] [h35n03:06804] [14] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x1dff4)[0x200000cfdff4] [h35n03:06804] *** End of error message ***

philip-davis avatar Jun 25 '21 14:06 philip-davis

@philip-davis Thank you very much. The error seems very similar to what @NiuYawei reported. I will check this issue again.

shintaro-iwasaki avatar Jun 25 '21 15:06 shintaro-iwasaki

I tested Argobots' memory-pool operations on Summit-like POWER9 machine at Argonne, but I could not reproduce this issue.

What I did (collapsed)

Argobots v1.1 + POWER9 + GCC 9.3, Spack-default configuration. The test creates 10 millions ULTs (no cutoff Fibonacci(34)) and schedule them in a random-work-stealing manner. I repeated this test with various numbers of ESs 500 times in total (which took a few hours).

## Environment
$ gcc --version
gcc (Spack GCC) 9.3.0
$ cat /proc/cpuinfo
...
cpu             : POWER9, altivec supported
...

## Configure Argobots (the same as the default "spack install argobots")
$ git checkout v1.1
$ sh autogen.sh
$ ./configure --prefix=$(pwd)/install --enable-perf-opt

## Build and run modified fibonacci
$ gcc fib.c -labt -L install/lib -I install/include/ -Wl,-rpath=$(pwd)/install/lib -o fib.out
$ cat test.sh
for repeat in $(seq 5); do
  for es in $(seq 100); do
    date
    echo "./fib.out -n 35 -e $es"
    ./fib.out -n 35 -e $es
  done
done
$ sh test.sh
Code

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <stdarg.h>
#include <abt.h>

#define DEFAULT_NUM_XSTREAMS 4
#define DEFAULT_N 10

ABT_pool *pools;

typedef struct {
    int n;
    int ret;
} fibonacci_arg_t;

void fibonacci(void *arg)
{
    int n = ((fibonacci_arg_t *)arg)->n;
    int *p_ret = &((fibonacci_arg_t *)arg)->ret;

    if (n <= 1) {
        *p_ret = 1;
    } else {
        fibonacci_arg_t child1_arg = { n - 1, 0 };
        fibonacci_arg_t child2_arg = { n - 2, 0 };
        int rank;
        ABT_xstream_self_rank(&rank);
        ABT_pool target_pool = pools[rank];
        ABT_thread child1;
        /* Calculate fib(n - 1). */
        ABT_thread_create(target_pool, fibonacci, &child1_arg,
                          ABT_THREAD_ATTR_NULL, &child1);
        /* Calculate fib(n - 2).  We do not create another ULT. */
        fibonacci(&child2_arg);
        ABT_thread_free(&child1);
        *p_ret = child1_arg.ret + child2_arg.ret;
    }
}

int fibonacci_seq(int n)
{
    if (n <= 1) {
        return 1;
    } else {
        int i;
        int fib_i1 = 1; /* Value of fib(i - 1) */
        int fib_i2 = 1; /* Value of fib(i - 2) */
        for (i = 3; i <= n; i++) {
            int tmp = fib_i1;
            fib_i1 = fib_i1 + fib_i2;
            fib_i2 = tmp;
        }
        return fib_i1 + fib_i2;
    }
}

int main(int argc, char **argv)
{
    int i, j;
    /* Read arguments. */
    int num_xstreams = DEFAULT_NUM_XSTREAMS;
    int n = DEFAULT_N;
    while (1) {
        int opt = getopt(argc, argv, "he:n:");
        if (opt == -1)
            break;
        switch (opt) {
            case 'e':
                num_xstreams = atoi(optarg);
                break;
            case 'n':
                n = atoi(optarg);
                break;
            case 'h':
            default:
                printf("Usage: ./fibonacci [-e NUM_XSTREAMS] [-n N]\n");
                return -1;
        }
    }

    /* Allocate memory. */
    ABT_xstream *xstreams =
        (ABT_xstream *)malloc(sizeof(ABT_xstream) * num_xstreams);
    pools = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
    ABT_sched *scheds = (ABT_sched *)malloc(sizeof(ABT_sched) * num_xstreams);

    /* Initialize Argobots. */
    ABT_init(argc, argv);

    /* Create pools. */
    for (i = 0; i < num_xstreams; i++) {
        ABT_pool_create_basic(ABT_POOL_FIFO, ABT_POOL_ACCESS_MPMC, ABT_TRUE,
                              &pools[i]);
    }

    /* Create schedulers. */
    for (i = 0; i < num_xstreams; i++) {
        ABT_pool *tmp = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
        for (j = 0; j < num_xstreams; j++) {
            tmp[j] = pools[(i + j) % num_xstreams];
        }
        ABT_sched_create_basic(ABT_SCHED_DEFAULT, num_xstreams, tmp,
                               ABT_SCHED_CONFIG_NULL, &scheds[i]);
        free(tmp);
    }

    /* Set up a primary execution stream. */
    ABT_xstream_self(&xstreams[0]);
    ABT_xstream_set_main_sched(xstreams[0], scheds[0]);

    /* Create secondary execution streams. */
    for (i = 1; i < num_xstreams; i++) {
        ABT_xstream_create(scheds[i], &xstreams[i]);
    }

    for (int i = 2; i <= n; i++) {
        fibonacci_arg_t arg = { i, 0 };
        fibonacci(&arg);
        int ret = arg.ret;
        int ans = fibonacci_seq(i);
        /* Check the results. */
        printf("Fibonacci(%d) = %d (ans: %d)\n", i, ret, ans);
    }

    /* Join secondary execution streams. */
    for (i = 1; i < num_xstreams; i++) {
        ABT_xstream_join(xstreams[i]);
        ABT_xstream_free(&xstreams[i]);
    }

    /* Finalize Argobots. */
    ABT_finalize();

    /* Free allocated memory. */
    free(xstreams);
    free(pools);
    free(scheds);

    return 0;
}

Although I have not confirmed the reason, I would first suggest you set ABT_THREAD_STACKSIZE=XXX where XXX is sufficiently large (~4096~ [EDIT] 16384 by default).

ABT_THREAD_STACKSIZE=256000 ./your_app.out

Explanation

If a ULT runs out of its function stack, it can overwrite num_headers and cause this assert failure.

image

[EDIT] 4KB is wrong. 16KB is correct.

I'm not sure the Margo's default stack size, but if Margo does not explicitly set it, possibly the program caused stack overflow considering the depth of function stack @philip-davis reported. By default it is ~4KB~ ([EDIT] 16KB). I am not fully sure if this is the reason since I cannot reproduce this issue; the scenario above assumes that num_headers is overwritten by 0, but it should not be always the case.

To examine this, the latest Argobots (main) supports mprotect-based dynamic detection (see #327): this feature allows the user to detect stack smash when it happens. Argobots 1.1 (the latest stable version) supports stack-canary based lazy detection (see #293). @mdorier: I would welcome any suggestions regarding this issue if you have.

shintaro-iwasaki avatar Jun 25 '21 20:06 shintaro-iwasaki

Margo sets ABT_THREAD_STACKSIZE to 2097152 by default, so I doubt that's the issue, but I could be wrong.

mdorier avatar Jun 25 '21 23:06 mdorier

@mdorier Thank you. I will check the memory pool implementation again.

shintaro-iwasaki avatar Jun 25 '21 23:06 shintaro-iwasaki

Specifically speaking to my issue, I am initializing Argobots outside of Margo, so Margo doesn't have the opportunity to change the value of ABT_THREAD_STACKSIZE. When I increase the stack size as suggested, the error (and a number of other hard to track down errors) disappear. This appears to have been the problem. Thank you.

On Fri, Jun 25, 2021 at 7:10 PM Shintaro Iwasaki @.***> wrote:

@mdorier https://github.com/mdorier Thank you. I will check the memory pool implementation again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmodels/argobots/issues/333#issuecomment-868876951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFIYQB35FR6R7UDTDJVLTUUEGVANCNFSM447R47NA .

philip-davis avatar Jun 25 '21 23:06 philip-davis

Thank you very much for the update, based on your comments I've tried updating to tip of argobots https://github.com/pmodels/argobots/commit/2202510f2bd4ba732a2ba2215171c0820320f58d and am building with --enable-debug=most and setting ABT_STACK_OVERFLOW_CHECK=mprotect and this has converted general instability I was seeing into a constant, reproducible segfault as a result of which I've managed to identify at least two areas of our code which require attention.

ashleypittman avatar Jun 28 '21 21:06 ashleypittman

Thank you very much for the update, based on your comments I've tried updating to tip of argobots 2202510 and am building with --enable-debug=most and setting ABT_STACK_OVERFLOW_CHECK=mprotect and this has converted general instability I was seeing into a constant, reproducible segfault as a result of which I've managed to identify at least two areas of our code which require attention.

I can confirm that once we'd tried with a build with the ABT_STACK_OVERFLOW_CHECK=memcheck feature and fixed two issues that were causing segfaults with that feature enabled we've not seen this and in fact the system has been remarkably stable since so I think we can confirm that the problems we were seeing were the result of stack overflow and I'd be happy to close this bug report now.

ashleypittman avatar Jul 19 '21 13:07 ashleypittman