Problem with MPI_File_preallocate
As had been requested by @adammoody I created a test to document a problem with MPI_File_preallocate under Unify. This function is used in one of the ROMIO examples, misc.c available at
https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/misc.c.in
However, to make the problem more explicit, I created a reproducer program, shown below. This program follows this sequence of MPI-IO calls (only major functions shown):
MPI_File_open MPI_File_preallocate (size=8KB) MPI_File_write_at (total=8KB) MPI_File_close
After each of the first three calls, each rank checks the file size with 'MPI_File_get_size'. There are barriers between the various phases just to make sure the ranks are always in sync within the phases.
Running the program with 4 processors, in a SINGLE node of Quartz under Unify I obtain these sizes:
[3] After File_open: SIZE=0 [2] After File_open: SIZE=0 [0] After File_open: SIZE=0 [1] After File_open: SIZE=0 [3] After preallocate(8KB): SIZE=0 [2] After preallocate(8KB): SIZE=0 [1] After preallocate(8KB): SIZE=0 [0] After preallocate(8KB): SIZE=8192 [0] After 8KB write_at: SIZE=4096 [1] After 8KB write_at: SIZE=6144 [2] After 8KB write_at: SIZE=8192 [3] After 8KB write_at: SIZE=8192
Thus, it seems that only Rank=0 finds the proper file size after the preallocate. And after all ranks write to the file, the ranks also have trouble in obtaining the resulting size.
Meanwhile, running WITHOUT Unify, also with 4 processors, I get the expected values, as follows:
[0] After File_open: SIZE=0 [1] After File_open: SIZE=0 [3] After File_open: SIZE=0 [2] After File_open: SIZE=0 [2] After preallocate(8KB): SIZE=8192 [0] After preallocate(8KB): SIZE=8192 [1] After preallocate(8KB): SIZE=8192 [3] After preallocate(8KB): SIZE=8192 [0] After 8KB write_at: SIZE=8192 [1] After 8KB write_at: SIZE=8192 [3] After 8KB write_at: SIZE=8192 [2] After 8KB write_at: SIZE=8192
The contents of the 8KB files produced with and without Unify are the same.
This is the reproducer test code:
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#include <unifyfs.h>
#define BUFLEN 2048
int main(int argc, char **argv)
{
int buf[BUFLEN], myrank, i, nints, nranks, errcode=0;
MPI_File fh;
MPI_Status status;
MPI_Offset offset,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
errcode = unifyfs_mount("/unifyfs", myrank, nranks, 0);
if (errcode) {
printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, errcode);
exit(-1);
}
for (i=0; i<BUFLEN; i++) buf[i]=myrank;
/* Open file */
errcode = MPI_File_open(MPI_COMM_WORLD, "ufs:/unifyfs/ofile",
MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
if (errcode != MPI_SUCCESS)
fprintf(stderr,"[%d] MPI_File_open error: %d\n",myrank,errcode);
MPI_File_get_size(fh,&size);
fprintf(stderr,"[%d] After File_open: SIZE=%d\n",myrank,size);
/* Sync ranks */
MPI_Barrier(MPI_COMM_WORLD);
/* Do preallocate of 8KB */
errcode = MPI_File_preallocate(fh, 8192);
if (errcode != MPI_SUCCESS)
fprintf(stderr,"[%d] MPI_File_preallocate error: %d\n",myrank,errcode);
/* Sync ranks */
MPI_Barrier(MPI_COMM_WORLD);
MPI_File_get_size(fh,&size);
fprintf(stderr,"[%d] After preallocate(8KB): SIZE=%d\n",myrank,size);
/* Sync ranks */
MPI_Barrier(MPI_COMM_WORLD);
/* Do a write_at to the file, 8K bytes total */
nints = BUFLEN/nranks;
offset = myrank * BUFLEN;
errcode = MPI_File_write_at(fh, offset, buf, nints, MPI_INT, &status);
if (errcode != MPI_SUCCESS)
fprintf(stderr,"[%d]: MPI_File_write_at error: %d\n",myrank,errcode);
/* Sync ranks */
MPI_Barrier(MPI_COMM_WORLD);
MPI_File_get_size(fh,&size);
fprintf(stderr,"[%d] After 8KB write_at: SIZE=%d\n",myrank,size);
/* Sync ranks */
MPI_Barrier(MPI_COMM_WORLD);
/* Close file */
MPI_File_close(&fh);
unifyfs_unmount();
MPI_Finalize();
return 0;
}
@kathrynmohror , @adammoody : I reran today this reproducer example above, under UnifyFS, using my Unify build based on PR#581. This time, with Unify, I obtained the following output:
[1] After File_open: SIZE=0
[0] After File_open: SIZE=0
[2] After File_open: SIZE=0
[3] After File_open: SIZE=0
[1] After preallocate(8KB): SIZE=8192
[0] After preallocate(8KB): SIZE=8192
[3] After preallocate(8KB): SIZE=8192
[2] After preallocate(8KB): SIZE=8192
[1] After 8KB write_at: SIZE=8192
[0] After 8KB write_at: SIZE=8192
[2] After 8KB write_at: SIZE=8192
[3] After 8KB write_at: SIZE=8192
Thus, it does produce the same contents that the the non-Unify execution had produced originally!
Also, I reran the original ROMIO example that was the root of this entire issue (misc.c). I ran it with and without Unify, using 4 processors in 2 nodes, and the resulting files now match precisely. Thus, I believe this issue has been resolved with the current Unify version!
Thanks @clmendes . Was this correct run using WRITE_SYNC=1?
If so, what do you get with WRITE_SYNC=0?
And if WRITE_SYNC=0 gives the wrong result, let's also try a test with WRITE_SYNC=0 but where we also add MPI_File_sync calls. I think we'd need one after MPI_File_preallocate and before the barrier, and then another after MPI_File_write_at and before the barrier.
@adammoody , for this reproducer code posted here it doesn't make any difference to have WRITE_SYNC or not: both versions produce the same output file.
However, for the original ROMIO test (misc.c), it does make a difference: with WRITE_SYNC=1 we obtain the same datafile as in the non-Unify execution, whereas without WRITE_SYNC the program goes to the end of execution but the produced datafile has wrong contents.
I reran this today, on LLNL's Lassen (IBM system), under Unify v.0.9.3, i.e. using a Unify build based on PR#619. The original MPI test code misc.c works correctly, produces the expected output, and the produced datafile is identical to the one produced without Unify. Also, the reproducer above works correctly as well, when I used WRITE_SYNC=1. Without that, the output was wrong.
The original code was run on Lassen at /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Misc/ and the reproducer was run at /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Misc/PREALLOC/
If I set WRITE_SYNC=0, but add the two MPI_File_sync calls like @adammoody had suggested in his comment above, then the code works fine, and produces the expected behavior, i.e. the output is
[2] After File_open: SIZE=0 [3] After File_open: SIZE=0 [3] After preallocate(8KB): SIZE=8192 [2] After preallocate(8KB): SIZE=8192 [2] After 8KB write_at: SIZE=8192 [3] After 8KB write_at: SIZE=8192 [0] After File_open: SIZE=0 [1] After File_open: SIZE=0 [1] After preallocate(8KB): SIZE=8192 [0] After preallocate(8KB): SIZE=8192 [1] After 8KB write_at: SIZE=8192 [0] After 8KB write_at: SIZE=8192
This new version, with the MPI_File_sync calls, run with WRITE_SYNC=0, is under /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Misc/PREALLOC/ADAM/