UnifyFS icon indicating copy to clipboard operation
UnifyFS copied to clipboard

Problem with MPI-IO's atomic file I/O

Open clmendes opened this issue 5 years ago • 6 comments

I observed errors in the results for one of the ROMIO tests, atomicity.c, running under Unify. This original example is posted at

https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/atomicity.c

This test checks "whether atomicity semantics are satisfied for overlapping accesses in atomic mode."

The original program has two phases. I divided it into two programs, one for the first phase (which works fine) and one for the second phase (where I observe errors). This second phase is reproduced in the example below.

In the example, rank=0 first writes "0" to the entire file. Next, rank=0 writes all elements as "10" while the other ranks read the file, and check the values read. This check is such that all elements are compared to the first element read. The example uses non-contiguous data, created with a new datayype and a file "view" for that. It must be noticed that this example uses a different MPI_Info, which sets the sizes for read and write operations in ROMIO. Thus, although the main piece of the code has only one MPI_File_write (by rank=0) and one MPI_File_read (by all other ranks), in fact there are many write/read operations.

Running this on 8 processors of the same node of Quartz@LLNL, I obtain this output:

Process 3: readbuf[139] is 0, should be 10 [3] At Exit, i=139 [0] At Exit, i=10000 Process 1: readbuf[139] is 0, should be 10 [1] At Exit, i=139 Process 2: readbuf[139] is 0, should be 10 [2] At Exit, i=139 Process 5: readbuf[9972] is 0, should be 10 [5] At Exit, i=9972 Process 4: readbuf[9972] is 0, should be 10 [4] At Exit, i=9972 Process 6: readbuf[9972] is 0, should be 10 [6] At Exit, i=9972 Process 7: readbuf[9972] is 0, should be 10 [7] At Exit, i=9972 Found 7 errors

Thus, all seven ranks detected an error at some point. Running this same program multiple times results in errors at different positions.

This is the test code:

#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unifyfs.h>

/* tests whether atomicity semantics are satisfied for overlapping accesses
   in atomic mode. The probability of detecting errors is higher if you run
   it on 8 or more processes. */

static void handle_error(int errcode, const char *str)
{
    char msg[MPI_MAX_ERROR_STRING];
    int resultlen;
    MPI_Error_string(errcode, msg, &resultlen);
    fprintf(stderr, "%s: %s\n", str, msg);
    MPI_Abort(MPI_COMM_WORLD, 1);
}

#define MPI_CHECK(fn) { int errcode; errcode = (fn); if (errcode != MPI_SUCCESS) handle_error(errcode, #fn); }


#define BUFSIZE 10000   /* no. of integers */
#define VERBOSE 0
int main(int argc, char **argv)
{
    int *writebuf, *readbuf, i, mynod, nprocs, len, err;
    int errs = 0, toterrs;
    MPI_Datatype newtype;
    MPI_File fh;
    MPI_Status status;
    MPI_Info info;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    err = unifyfs_mount("/unifyfs", mynod, nprocs, 0);
    if (err) {
        printf("[%d] unifyfs_mount failed (return = %d)\n", mynod, err);
        exit(-1);
    }

    writebuf = (int *) malloc(BUFSIZE * sizeof(int));
    readbuf = (int *) malloc(BUFSIZE * sizeof(int));

/* repeat the same test with a noncontiguous filetype */

    MPI_Type_vector(BUFSIZE, 1, 2, MPI_INT, &newtype);
    MPI_Type_commit(&newtype);

    MPI_Info_create(&info);
    /* I am setting these info values for testing purposes only. It is
     * better to use the default values in practice. */
    MPI_Info_set(info, "ind_rd_buffer_size", "1209");
    MPI_Info_set(info, "ind_wr_buffer_size", "1107");

    if (!mynod) {
        MPI_CHECK(MPI_File_open(MPI_COMM_SELF, "ufs:/unifyfs/ofile",
                                MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh));
        for (i = 0; i < BUFSIZE; i++)
            writebuf[i] = 0;
        MPI_CHECK(MPI_File_set_view(fh, 0, MPI_INT, newtype, "native", info));
        MPI_CHECK(MPI_File_write(fh, writebuf, BUFSIZE, MPI_INT, &status));
        MPI_File_close(&fh);
#if VERBOSE
        fprintf(stderr, "\ntesting noncontiguous accesses\n");
#endif
    }
    MPI_Barrier(MPI_COMM_WORLD);

    for (i = 0; i < BUFSIZE; i++)
        writebuf[i] = 10;
    for (i = 0; i < BUFSIZE; i++)
        readbuf[i] = 20;

    MPI_CHECK(MPI_File_open(MPI_COMM_WORLD, "ufs:/unifyfs/ofile", MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh));
    MPI_CHECK(MPI_File_set_atomicity(fh, 1));
    MPI_CHECK(MPI_File_set_view(fh, 0, MPI_INT, newtype, "native", info));
    MPI_Barrier(MPI_COMM_WORLD);

    if (!mynod) {
        MPI_CHECK(MPI_File_write(fh, writebuf, BUFSIZE, MPI_INT, &status));
    } else {
        err = MPI_File_read(fh, readbuf, BUFSIZE, MPI_INT, &status);
        if (err == MPI_SUCCESS) {
            if (readbuf[0] == 0) {
                for (i = 1; i < BUFSIZE; i++)
                    if (readbuf[i] != 0) {
                        errs++;
                        fprintf(stderr, "Process %d: readbuf[%d] is %d, should be 0\n", mynod, i,
                                readbuf[i]);
                        goto fn_exit;
                    }
            } else if (readbuf[0] == 10) {
                for (i = 1; i < BUFSIZE; i++)
                    if (readbuf[i] != 10) {
                        errs++;
                        fprintf(stderr, "Process %d: readbuf[%d] is %d, should be 10\n", mynod, i,
                                readbuf[i]);
                        goto fn_exit;
                    }
            } else {
                errs++;
                fprintf(stderr, "Process %d: readbuf[0] is %d, should be either 0 or 10\n", mynod,
                        readbuf[0]);
            }
        }
    }

    MPI_Type_free(&newtype);
    MPI_Info_free(&info);

  fn_exit:
    fprintf(stderr,"[%d] At Exit, i=%d\n",mynod,i);
    MPI_File_close(&fh);

    MPI_Barrier(MPI_COMM_WORLD);

    MPI_Allreduce(&errs, &toterrs, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    if (mynod == 0) {
        if (toterrs > 0) {
            fprintf(stderr, "Found %d errors\n", toterrs);
        } else {
            fprintf(stdout, " No Errors\n");
        }
    }
    free(writebuf);
    free(readbuf);

    unifyfs_unmount();
    MPI_Finalize();
    return 0;
}

clmendes avatar Jul 28 '20 03:07 clmendes

Just a quick note to observe that today I tested this program again, after REMOVING the two MPI_Info_set calls, i.e. the info structure is no longer set. Thus, each of the write/read operations are for the full buffer extent. The program does work correctly, with Unify, in this situation.

clmendes avatar Jul 29 '20 22:07 clmendes

I rested today the code listed above, with the current Unify version, and keeping the setting of the MPI_Info as originally set in the code. Running it on 8 processors, I get no errors, both with WRITE_SYNC=1 and WRITE_SYNC=0. This is the output observed:

[4] At Exit, i=10000 [5] At Exit, i=10000 [6] At Exit, i=10000 [1] At Exit, i=10000 [3] At Exit, i=10000 [2] At Exit, i=10000 [7] At Exit, i=10000 [0] At Exit, i=10000

I also noted in the client log that most write operations have size 1107 bytes:

@ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=69803 - successful logio_write() @ log offset=8912896 (1107 bytes)

Similarly, read_requests have length 1209:

@ unifyfs_gfid_read_reqs() [unifyfs.c:1097] read: offset:4840, len:1209

These values are consistent with the settings of the MPI_Info in the source code. Thus, I believe that the atomic mode is working correctly, even when the reads/writes are sliced according to MPI_Info.

clmendes avatar Dec 07 '20 05:12 clmendes

And for the sake of completeness, I also reran the original ROMIO test (atomicity.c) under Unify, using 8 processors on 2 catalyst nodes (i.e. 4 processors/node), and it ran fine. No errors were found, again using the same settings of MPI_Info as before.

clmendes avatar Dec 07 '20 05:12 clmendes

Linking this code segment, which is related: https://github.com/LLNL/UnifyFS/issues/560#issuecomment-723272035

adammoody avatar Dec 07 '20 19:12 adammoody

I ran today on Lassen (IBM) the test-code listed above, using eight processors on a single node and a smaller dataset size (BUFSIZE=100), with the settings of MPI_Info commented out, and the execution failed! This is part of what is seen on stderr:

[0] At Exit, i=100
Process 7: readbuf[1] is 10, should be 0
[7] At Exit, i=1
[5] At Exit, i=100
[6] At Exit, i=100
Process 2: readbuf[1] is 10, should be 0
[2] At Exit, i=1
[3] At Exit, i=100
Process 4: readbuf[1] is 10, should be 0
[4] At Exit, i=1
[1] At Exit, i=100
Found 3 errors

For this test, I used a Unify version based on PR#619 and the non-optimized Argobots build. The files for this execution are under directory /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Atomicity

Thus, the behavior on Lassen is worse than it was on Catalyst, where at least the program would work correctly when no settings of MPI_Info were done.

clmendes avatar Sep 02 '21 03:09 clmendes

Thanks, @clmendes . I believe ROMIO may require file locking with fcntl to support atomicity. Since UnifyFS doesn't yet support file locking with fcntl, the ROMIO atomicity support is probably not valid. In theory, one could implement some other synchronization method in ROMIO, but that would be a lot of work. It would be better to add some basic file locking with fcntl in UnifyFS.

For the short term, it might be nice to at least have wrappers that just return an error to signal that fcntl is not working. That would help us identify apps that require it, assuming they are checking and flagging errors. We could then have an option that lets someone opt to keep going so fcntl returns success even though it's broken.

adammoody avatar Sep 13 '21 20:09 adammoody