UnifyFS icon indicating copy to clipboard operation
UnifyFS copied to clipboard

Problem with MPI-IO's shared file pointer

Open clmendes opened this issue 5 years ago • 8 comments

@kathrynmohror amd @adammoody , I've observed a problem when testing another of the ROMIO tests. This is a test that addresses the use of shared file pointers. The original program, shared_fp.c, is posted at

https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/shared_fp.c

In this program, a shared file pointer is used, and each MPI rank writes a buffer of size 4,096 KBytes, using the function MPI_File_write_shared, to a common file. That function implies that the writes are such that the order of serialization is not deterministic.

The integer data buffer written to the file is set with the following scheme, where COUNT is 1,024:

for (i = 0; i < COUNT; i++) buf[i] = COUNT * myrank + i;

I created a reproducer of the problem with the program posted below, which can be used with plain MPI or with UNIFY. With plain MPI, running with 2 processors generates a file of size 8,192 bytes, and running with 4 processors generates a file of size 16,384 bytes. The contents of the file are consistent with the form of generation above, although the order inside the file varies. Running with Unify, I always get a file of size 4,096 bytes, regardless of the number of processors used. So, I suspect that Unify is having trouble to handle the shared file pointer correctly.

Test code:

#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#ifdef UNIFY
#include <unifyfs.h>
char *filename="ufs:/unifyfs/datafile-u" ;
int ret;
#else
char *filename="datafile-m";
#endif

#define COUNT 1024

/* tests shared file pointer functions */

int main(int argc, char **argv)
{
    int *buf, i, myrank, nranks, len, sum, global_sum;
    int errs = 0, toterrs, errcode;
    MPI_File fh;
    MPI_Status status;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
#ifdef UNIFY
    ret = unifyfs_mount("/unifyfs", myrank, nranks, 0);
    if (ret) {
        printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, ret);
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
#endif

    buf = (int *) malloc(COUNT * sizeof(int));

    for (i = 0; i < COUNT; i++) buf[i] = COUNT * myrank + i;

    errcode = MPI_File_open(MPI_COMM_WORLD, filename,
                            MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
    if (errcode != MPI_SUCCESS) {
        fprintf(stderr, "[%d] Error on MPI_File_open: %d\n",myrank,errcode);
        errs++;
    }

    errcode = MPI_File_write_shared(fh, buf, COUNT, MPI_INT, &status);
    if (errcode != MPI_SUCCESS) {
        fprintf(stderr, "[%d] Error on MPI_File_write_shared: %d\n",myrank,errcode);
        errs++;
    }

    for (i = 0; i < COUNT; i++) buf[i] = 0;
    MPI_Barrier(MPI_COMM_WORLD);

    errcode = MPI_File_seek_shared(fh, 0, MPI_SEEK_SET);
    if (errcode != MPI_SUCCESS) {
        fprintf(stderr, "[%d] Error on MPI_File_seek_shared: %d\n",myrank,errcode);
        errs++;
    }

    errcode = MPI_File_read_shared(fh, buf, COUNT, MPI_INT, &status);
    if (errcode != MPI_SUCCESS) {
        fprintf(stderr, "[%d] Error on MPI_File_read_shared: %d\n",myrank,errcode);
        errs++;
    }

    MPI_File_close(&fh);

    sum = 0;
    for (i = 0; i < COUNT; i++) sum += buf[i];

    MPI_Allreduce(&sum, &global_sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

    if (global_sum != (((COUNT * nranks - 1) * (COUNT * nranks)) / 2)) {
        errs++;
        fprintf(stderr, "[%d] Error: sum %d, global_sum %d, %d\n", myrank,
                sum, global_sum, (((COUNT * nranks - 1) * (COUNT * nranks)) / 2));
    }
    free(buf);

    MPI_Allreduce(&errs, &toterrs, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    if (myrank == 0) {
        if (toterrs > 0) {
            fprintf(stderr, "Found %d errors\n", toterrs);
        } else {
            fprintf(stdout, " No Errors\n");
        }
    }
#ifdef UNIFY
    unifyfs_unmount();
#endif
    MPI_Finalize();
    return 0;
}

clmendes avatar Aug 03 '20 01:08 clmendes

I rested today the original ROMIO example (shared_fp.c) with the new Unify, and observed the same error.

Even using WRITE_SYNC=1, the resulting file has only 4096 bytes, and in the logs of the clients I only see one write of 4096 bytes:

unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes)

Meanwhile, in the original non-Unify execution, which generates a file of 16,384 bytes, I used Recorder and got in the traces the following Write records:

Rank=0: 30.039525 30.039527 0 lseek ( 15 4096 0 ) 30.040384 30.040396 0 write ( 15 0x1ebea60 4096 )

Rank=1: 0.038211 0.038224 0 write ( 13 0x77cb30 4096 )

Rank=2: 90.060523 90.060527 0 lseek ( 15 12288 0 ) 90.061027 90.061039 0 write ( 15 0x1ec4b70 4096 )

Rank=3: 60.060279 60.060281 0 lseek ( 13 8192 0 ) 60.060905 60.060920 0 write ( 13 0x7768a0 4096 )

Thus, the resulting order of the data in the produced file (ofile) for this non-Unify execution is: 0: Rank=1 4096: Rank=0 8192: Rank=3 12288: Rank=2

The code writes to the file using MPI_File_write_shared(), for which the documentation says: "The order of serialization is not deterministic". Hence, the non-Unify execution seems to be valid.

With Unify, for some reason, only one of the ranks issues the Write.

The code has this sequence of calls: MPI_File_open(... MPI_File_write_shared(... MPI_Barrier(...

In the trace of Rank=0 from Recorder, I see this sequence:

0.010607 0.013768 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 9 0x7fffffffb0b8 0x7fffffffb928 ) 0.011217 0.011222 0 __lxstat ( 1 ofile 0x7fffffffaed0 ) 0.011765 0.011790 -1 open ( /etc/romio-hints 0 ) 0.011808 0.011811 0 read ( -1 0x1ec2130 4096 ) 0.012019 0.012022 0 umask ( 18 ) 0.012023 0.012024 0 umask ( 63 ) 0.012025 0.012897 15 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 66 393 ) 0.013797 30.042158 0 MPI_File_write_shared ( 0x1ebe8e0 0x1ebea60 1024 MPI_INT [0_0_0] ) 0.013859 0.013859 0 umask ( 18 ) 0.013861 0.013862 0 umask ( 63 ) 0.013864 0.014792 16 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/.ofile.shfp.128836.1779196920 66 393 ) 0.014827 30.035925 0 fcntl ( 16 7 1 ) 30.035933 30.037377 0 read ( 16 0x7fffffffb018 8 ) 30.037384 30.037389 0 lseek ( 16 0 0 ) 30.037390 30.037397 0 write ( 16 0x7fffffffaf80 8 ) 30.037400 30.039516 0 fcntl ( 16 6 2 ) 30.039525 30.039527 0 lseek ( 15 4096 0 ) 30.039535 30.040376 0 fcntl ( 15 7 1 ) 30.040384 30.040396 0 write ( 15 0x1ebea60 4096 ) 30.040400 30.042131 0 fcntl ( 15 6 2 ) 30.042166 90.062487 0 MPI_Barrier ( MPI_COMM_WORLD )

The operations with fid=15 are the ones related to the datafile ofile. This Rank=0 does write the 4096 bytes at position 4096 (according to the prior lseek(15 4096 0). The behavior for the other Ranks is similar, with different lseek arguments as expected (e.g. Rank=1 seeks to pos=0).

The operations above with fid=16 seem to implement some kind of shared file. In the Unify client logs I do see things like

unifyfs_fid_open() [unifyfs.c:1918] Creating a new entry for /unifyfs/.ofile.shfp.133385.1273923192 . . . @ unifyfs_fid_open() [unifyfs.c:1939] Failed to populate the global meta entry for /unifyfs/.ofile.shfp.133385.1273923192 (fid:1)

Hence, at this point it seems that Unify causes an error with this "shared" file.

clmendes avatar Dec 06 '20 23:12 clmendes

Nice investigation work. I haven't checked the code path, but I'm guessing those fcntl calls from the recorder trace are trying to lock/unlock access to the files. Since we don't have file locking, we'd not be able to support shared file pointers.

We should also dig into why we're seeing the "failed to populate entry" error. That could be a separate issue from file locking.

adammoody avatar Dec 07 '20 19:12 adammoody

Tracing the code a bit for the "failed to populate entry", I can we're returning an errno=EEXIST when trying to insert the inode into the tree. It looks to me like ROMIO opens this file using MPI_COMM_SELF instead of the communicator. That means each file is calling open to create the file. I think we are then exposing a race condition in UnifyFS when multiple procs try to create the same file. Working on a patch...

adammoody avatar Dec 07 '20 23:12 adammoody

@clmendes , here is PR fix for the "failed to populate entry" error: https://github.com/LLNL/UnifyFS/pull/587

We still will not support shared file pointers in general, since those require file locking. However, this should get us past that other error. It's worth trying again in case the test turns up something else.

adammoody avatar Dec 08 '20 00:12 adammoody

For what it may be worth, these are the full traces that I obtained with Recorder in the non-Unify execution with 4 processors on 2 nodes of Catalyst (i.e. 2 processors/node).

Rank=0:

0.010503 0.010504 0 MPI_Comm_rank ( MPI_COMM_WORLD 0 )
0.010515 0.010557 0 MPI_Bcast ( 0x7fffffffb924 1 MPI_INT 0 MPI_COMM_WORLD )
0.010581 0.010591 0 MPI_Bcast ( 0x1ebe6c0 15 MPI_CHAR 0 MPI_COMM_WORLD )
0.010597 0.010597 0 MPI_Comm_rank ( MPI_COMM_WORLD 0 )
0.010601 0.010601 0 MPI_Comm_size ( MPI_COMM_WORLD 4 )
0.010607 0.013768 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 9 0x7fffffffb0b8 0x7fffffffb928 )
0.011217 0.011222 0 __lxstat ( 1 ofile 0x7fffffffaed0 )
0.011765 0.011790 -1 open ( /etc/romio-hints 0 )
0.011808 0.011811 0 read ( -1 0x1ec2130 4096 )
0.012019 0.012022 0 umask ( 18 )
0.012023 0.012024 0 umask ( 63 )
0.012025 0.012897 15 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 66 393 )
0.013797 30.042158 0 MPI_File_write_shared ( 0x1ebe8e0 0x1ebea60 1024 MPI_INT [0_0_0] )
0.013859 0.013859 0 umask ( 18 )
0.013861 0.013862 0 umask ( 63 )
0.013864 0.014792 16 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/.ofile.shfp.128836.1779196920 66 393 )
0.014827 30.035925 0 fcntl ( 16 7 1 )
30.035933 30.037377 0 read ( 16 0x7fffffffb018 8 )
30.037384 30.037389 0 lseek ( 16 0 0 )
30.037390 30.037397 0 write ( 16 0x7fffffffaf80 8 )
30.037400 30.039516 0 fcntl ( 16 6 2 )
30.039525 30.039527 0 lseek ( 15 4096 0 )
30.039535 30.040376 0 fcntl ( 15 7 1 )
30.040384 30.040396 0 write ( 15 0x1ebea60 4096 )
30.040400 30.042131 0 fcntl ( 15 6 2 )
30.042166 90.062487 0 MPI_Barrier ( MPI_COMM_WORLD )
90.062510 90.065028 0 MPI_File_seek_shared ( 0x1ebe8e0 0 MPI_SEEK_SET )
90.062541 90.063935 0 fcntl ( 16 7 1 )
90.063945 90.063946 0 lseek ( 16 0 0 )
90.063949 90.063956 0 write ( 16 0x7fffffffafe8 8 )
90.063958 90.065006 0 fcntl ( 16 6 2 )
90.065037 120.070289 0 MPI_File_read_shared ( 0x1ebe8e0 0x1ebea60 1024 MPI_INT [0_0_0] )
90.065040 120.065819 0 fcntl ( 16 7 1 )
120.065826 120.065829 0 lseek ( 16 0 0 )
120.065830 120.066949 0 read ( 16 0x7fffffffb020 8 )
120.066956 120.066958 0 lseek ( 16 0 0 )
120.066961 120.066968 0 write ( 16 0x7fffffffaf70 8 )
120.066970 120.068078 0 fcntl ( 16 6 2 )
120.068090 120.068094 0 lseek ( 15 4096 0 )
120.068096 120.068630 0 fcntl ( 15 7 0 )
120.068637 120.069761 0 read ( 15 0x1ebea60 4096 )
120.069768 120.070281 0 fcntl ( 15 6 2 )
120.070302 180.074466 0 MPI_File_close ( 0x7fffffffb928 )
180.071178 180.071185 0 close ( 16 )
180.071199 180.072232 0 unlink ( .ofile.shfp.128836.1779196920 )
180.074440 180.074445 0 close ( 15 )
180.074477 180.075012 0 MPI_Allreduce ( 0x7fffffffb940 0x7fffffffb944 1 MPI_INT 1476395011 MPI_COMM_WORLD )
180.075021 180.075037 0 MPI_Allreduce ( 0x7fffffffb948 0x7fffffffb938 1 MPI_INT 1476395011 MPI_COMM_WORLD )

Rank=1:

0.003702 0.003702 0 MPI_Comm_rank ( MPI_COMM_WORLD 1 )
0.003716 0.010525 0 MPI_Bcast ( 0x7fffffffb924 1 MPI_INT 0 MPI_COMM_WORLD )
0.010549 0.010560 0 MPI_Bcast ( 0x77c940 15 MPI_CHAR 0 MPI_COMM_WORLD )
0.010567 0.010567 0 MPI_Comm_rank ( MPI_COMM_WORLD 1 )
0.010571 0.010571 0 MPI_Comm_size ( MPI_COMM_WORLD 4 )
0.010576 0.013734 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 9 0x7fffffffb0b8 0x7fffffffb928 )
0.011166 0.011171 0 __lxstat ( 1 ofile 0x7fffffffaed0 )
0.011985 0.011987 0 umask ( 18 )
0.011989 0.011990 0 umask ( 63 )
0.011991 0.013669 13 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 66 393 )
0.013758 0.039563 0 MPI_File_write_shared ( 0x77eff0 0x77cb30 1024 MPI_INT [0_0_0] )
0.013828 0.013830 0 umask ( 18 )
0.013831 0.013832 0 umask ( 63 )
0.013833 0.015742 14 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/.ofile.shfp.128836.1779196920 66 393 )
0.015777 0.035765 0 fcntl ( 14 7 1 )
0.035773 0.036170 0 read ( 14 0x7fffffffb018 8 )
0.036182 0.036185 0 lseek ( 14 0 0 )
0.036188 0.036199 0 write ( 14 0x7fffffffaf80 8 )
0.036201 0.037567 0 fcntl ( 14 6 2 )
0.037580 0.038204 0 fcntl ( 13 7 1 )
0.038211 0.038224 0 write ( 13 0x77cb30 4096 )
0.038227 0.039545 0 fcntl ( 13 6 2 )
0.039569 90.062464 0 MPI_Barrier ( MPI_COMM_WORLD )
90.062484 90.064990 0 MPI_File_seek_shared ( 0x77eff0 0 MPI_SEEK_SET )
90.065001 90.070792 0 MPI_File_read_shared ( 0x77eff0 0x77cb30 1024 MPI_INT [0_0_0] )
90.065003 90.065457 0 fcntl ( 14 7 1 )
90.065462 90.065464 0 lseek ( 14 0 0 )
90.065466 90.066667 0 read ( 14 0x7fffffffb020 8 )
90.066695 90.066697 0 lseek ( 14 0 0 )
90.066699 90.066717 0 write ( 14 0x7fffffffaf70 8 )
90.066720 90.068186 0 fcntl ( 14 6 2 )
90.068195 90.068197 0 lseek ( 13 0 0 )
90.068200 90.068794 0 fcntl ( 13 7 0 )
90.068798 90.070194 0 read ( 13 0x77cb30 4096 )
90.070198 90.070771 0 fcntl ( 13 6 2 )
90.070801 180.074944 0 MPI_File_close ( 0x7fffffffb928 )
180.071155 180.071431 0 close ( 14 )
180.071447 180.072677 0 unlink ( .ofile.shfp.128836.1779196920 )
180.074521 180.074914 0 close ( 13 )
180.074955 180.074981 0 MPI_Allreduce ( 0x7fffffffb940 0x7fffffffb944 1 MPI_INT 1476395011 MPI_COMM_WORLD )
180.074989 180.075003 0 MPI_Allreduce ( 0x7fffffffb948 0x7fffffffb938 1 MPI_INT 1476395011 MPI_COMM_WORLD )

Rank=2:

0.006280 0.006280 0 MPI_Comm_rank ( MPI_COMM_WORLD 2 )
0.006293 0.010655 0 MPI_Bcast ( 0x7fffffffb924 1 MPI_INT 0 MPI_COMM_WORLD )
0.010680 0.010698 0 MPI_Bcast ( 0x1ec4980 15 MPI_CHAR 0 MPI_COMM_WORLD )
0.010704 0.010704 0 MPI_Comm_rank ( MPI_COMM_WORLD 2 )
0.010708 0.010708 0 MPI_Comm_size ( MPI_COMM_WORLD 4 )
0.010714 0.013859 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 9 0x7fffffffb0b8 0x7fffffffb928 )
0.011317 0.011321 0 __lxstat ( 1 ofile 0x7fffffffaed0 )
0.012111 0.012114 0 umask ( 18 )
0.012115 0.012116 0 umask ( 63 )
0.012118 0.013691 15 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 66 393 )
0.013885 90.062527 0 MPI_File_write_shared ( 0x1ec4a30 0x1ec4b70 1024 MPI_INT [0_0_0] )
0.013972 0.013972 0 umask ( 18 )
0.013974 0.013976 0 umask ( 63 )
0.013977 0.015599 16 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/.ofile.shfp.128836.1779196920 66 393 )
0.015635 90.057967 0 fcntl ( 16 7 1 )
90.057976 90.059138 0 read ( 16 0x7fffffffb018 8 )
90.059144 90.059148 0 lseek ( 16 0 0 )
90.059149 90.059168 0 write ( 16 0x7fffffffaf80 8 )
90.059170 90.060515 0 fcntl ( 16 6 2 )
90.060523 90.060527 0 lseek ( 15 12288 0 )
90.060534 90.061019 0 fcntl ( 15 7 1 )
90.061027 90.061039 0 write ( 15 0x1ec4b70 4096 )
90.061043 90.062510 0 fcntl ( 15 6 2 )
90.062535 90.062555 0 MPI_Barrier ( MPI_COMM_WORLD )
90.062560 90.065120 0 MPI_File_seek_shared ( 0x1ec4a30 0 MPI_SEEK_SET )
90.065131 180.071212 0 MPI_File_read_shared ( 0x1ec4a30 0x1ec4b70 1024 MPI_INT [0_0_0] )
90.065134 180.067224 0 fcntl ( 16 7 1 )
180.067230 180.067234 0 lseek ( 16 0 0 )
180.067234 180.068296 0 read ( 16 0x7fffffffb020 8 )
180.068303 180.068305 0 lseek ( 16 0 0 )
180.068307 180.068315 0 write ( 16 0x7fffffffaf70 8 )
180.068319 180.069389 0 fcntl ( 16 6 2 )
180.069401 180.069403 0 lseek ( 15 12288 0 )
180.069406 180.069813 0 fcntl ( 15 7 0 )
180.069820 180.070810 0 read ( 15 0x1ec4b70 4096 )
180.070816 180.071203 0 fcntl ( 15 6 2 )
180.071220 180.074255 0 MPI_File_close ( 0x7fffffffb928 )
180.071244 180.071253 0 close ( 16 )
180.071256 180.072582 0 unlink ( .ofile.shfp.128836.1779196920 )
180.074227 180.074235 0 close ( 15 )
180.074264 180.075110 0 MPI_Allreduce ( 0x7fffffffb940 0x7fffffffb944 1 MPI_INT 1476395011 MPI_COMM_WORLD )
180.075119 180.075130 0 MPI_Allreduce ( 0x7fffffffb948 0x7fffffffb938 1 MPI_INT 1476395011 MPI_COMM_WORLD )

Rank=3:

0.005211 0.005212 0 MPI_Comm_rank ( MPI_COMM_WORLD 3 )
0.005223 0.010636 0 MPI_Bcast ( 0x7fffffffb924 1 MPI_INT 0 MPI_COMM_WORLD )
0.010660 0.010679 0 MPI_Bcast ( 0x7766b0 15 MPI_CHAR 0 MPI_COMM_WORLD )
0.010686 0.010686 0 MPI_Comm_rank ( MPI_COMM_WORLD 3 )
0.010689 0.010690 0 MPI_Comm_size ( MPI_COMM_WORLD 4 )
0.010695 0.013837 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 9 0x7fffffffb0b8 0x7fffffffb928 )
0.011298 0.011301 0 __lxstat ( 1 ofile 0x7fffffffaed0 )
0.012089 0.012091 0 umask ( 18 )
0.012093 0.012094 0 umask ( 63 )
0.012095 0.012994 13 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/ofile 66 393 )
0.013863 60.062520 0 MPI_File_write_shared ( 0x778d60 0x7768a0 1024 MPI_INT [0_0_0] )
0.013944 0.013945 0 umask ( 18 )
0.013947 0.013948 0 umask ( 63 )
0.013949 0.014994 14 open ( /g/g12/mendes3/MPI/ROMIO-TESTS/MPI/Shared_fp/.ofile.shfp.128836.1779196920 66 393 )
0.015333 60.057618 0 fcntl ( 14 7 1 )
60.057628 60.058731 0 read ( 14 0x7fffffffb018 8 )
60.058743 60.058746 0 lseek ( 14 0 0 )
60.058749 60.058761 0 write ( 14 0x7fffffffaf80 8 )
60.058764 60.060272 0 fcntl ( 14 6 2 )
60.060279 60.060281 0 lseek ( 13 8192 0 )
60.060290 60.060898 0 fcntl ( 13 7 1 )
60.060905 60.060920 0 write ( 13 0x7768a0 4096 )
60.060923 60.062502 0 fcntl ( 13 6 2 )
60.062525 90.062551 0 MPI_Barrier ( MPI_COMM_WORLD )
90.062573 90.065096 0 MPI_File_seek_shared ( 0x778d60 0 MPI_SEEK_SET )
90.065107 150.072921 0 MPI_File_read_shared ( 0x778d60 0x7768a0 1024 MPI_INT [0_0_0] )
90.065109 150.066992 0 fcntl ( 14 7 1 )
150.067000 150.067003 0 lseek ( 14 0 0 )
150.067009 150.068421 0 read ( 14 0x7fffffffb020 8 )
150.068427 150.068430 0 lseek ( 14 0 0 )
150.068432 150.068451 0 write ( 14 0x7fffffffaf70 8 )
150.068454 150.070326 0 fcntl ( 14 6 2 )
150.070338 150.070341 0 lseek ( 13 8192 0 )
150.070344 150.070958 0 fcntl ( 13 7 0 )
150.070965 150.072272 0 read ( 13 0x7768a0 4096 )
150.072279 150.072897 0 fcntl ( 13 6 2 )
150.072934 180.074133 0 MPI_File_close ( 0x7fffffffb928 )
180.071248 180.071257 0 close ( 14 )
180.071269 180.072233 0 unlink ( .ofile.shfp.128836.1779196920 )
180.074107 180.074114 0 close ( 13 )
180.074151 180.075092 0 MPI_Allreduce ( 0x7fffffffb940 0x7fffffffb944 1 MPI_INT 1476395011 MPI_COMM_WORLD )
180.075101 180.075110 0 MPI_Allreduce ( 0x7fffffffb948 0x7fffffffb938 1 MPI_INT 1476395011 MPI_COMM_WORLD )

clmendes avatar Dec 08 '20 00:12 clmendes

@adammoody , I just tested this example again using UnifyFS based on PR#587. The behavior improved, but it's not correct yet. Now, the resulting file has 8KB (i.e. more than the 4KB it had before, and less than the 16KB that the non-Unify execution produces for a similar execution with 4 processors).

I see now in the Unify client logs the following four records:

tid=34844 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes) tid=34843 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes) tid=133253 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=1 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes) tid=133252 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=4096 - successful logio_write() @ log offset=65536 (4096 bytes)

Thus, three ranks try to write at position=0, while one will write to position=4096. Hence, the file ends up with 8KB. The execution does go through the end, though.

When reading the data back, the four ranks try to read at the same position=0:

tid=34844 @ process_read_data() [unifyfs.c:676] processing data response from server: [0] (gfid=1182513714, offset=0, length=4096, errcode=0) tid=133253 @ process_read_data() [unifyfs.c:676] processing data response from server: [0] (gfid=1182513714, offset=0, length=4096, errcode=0) tid=34843 @ process_read_data() [unifyfs.c:676] processing data response from server: [0] (gfid=1182513714, offset=0, length=4096, errcode=0) tid=133252 @ process_read_data() [unifyfs.c:676] processing data response from server: [0] (gfid=1182513714, offset=0, length=4096, errcode=0)

clmendes avatar Dec 08 '20 01:12 clmendes

Thanks, @clmendes . Yes, there were at least two different bugs that your test case was exposing.

The fix I added should help with a race condition we have when multiple procs open a file with O_CREAT at the same time. Based on what you're seeing now, I think we're past that one. Though I'd like to rewrite our open code, so we may try this again.

The remaining problem that this doesn't address is to add proper file locking.

adammoody avatar Dec 08 '20 05:12 adammoody

@adammoody , after discussing about this with @CamStan earlier today, I reran this reproducer code on Lassen, and got a similar behavior, although with a slight difference in the observed error.

The test was done on Lassen under /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Shared_fp/ISSUE459/ and the resulting data file (datafile-u) has 8 KB, instead of the expected 16KB. However, by looking at the client logs from Unify, now two of the ranks wrote at offset zero and two of them wrote at offset 4K:

2021-06-09T17:09:59 tid=23124 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes)
2021-06-09T17:09:59 tid=23125 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=0 pos=0 - successful logio_write() @ log offset=65536 (4096 bytes)
2021-06-09T17:09:59 tid=66137 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=1 pos=4096 - successful logio_write() @ log offset=65536 (4096 bytes)
2021-06-09T17:09:59 tid=66138 @ unifyfs_fid_logio_write() [unifyfs-fixed.c:326] fid=1 pos=4096 - successful logio_write() @ log offset=65536 (4096 bytes)

Thus, shared-file-pointer is still a problem with the current Unify version, as expected (given the lack of locking mechanisms), but the behavior is slightly different, as in the past three of the ranks would write at offset zero.

For reading, the behavior on Lassen was the same as in the past: all four ranks under Unify tried to read from offset zero in the file:

2021-06-09T17:09:59 tid=23124 @ debug_print_read_req() [client_read.c:25] read_req[0x7fffffff9fa0] gfid=1953131809, file offset=0, length=4096, buf=0x13ee85a0 - nread=4096, errcode=115 (Operation now in progress), byte coverage=[0,4095]
2021-06-09T17:09:59 tid=23125 @ debug_print_read_req() [client_read.c:25] read_req[0x7fffffff9fa0] gfid=1953131809, file offset=0, length=4096, buf=0x13ee8120 - nread=4096, errcode=115 (Operation now in progress), byte coverage=[0,4095]
2021-06-09T17:09:59 tid=66138 @ debug_print_read_req() [client_read.c:25] read_req[0x7fffffff9fa0] gfid=1953131809, file offset=0, length=4096, buf=0x13ee7ee0 - nread=4096, errcode=115 (Operation now in progress), byte coverage=[0,4095]
2021-06-09T17:09:59 tid=66137 @ debug_print_read_req() [client_read.c:25] read_req[0x7fffffff9fa0] gfid=1953131809, file offset=0, length=4096, buf=0x13ee7df0 - nread=4096, errcode=115 (Operation now in progress), byte coverage=[0,4095]

clmendes avatar Jun 10 '21 00:06 clmendes