Problem with strided writes by multiple ranks
As I have promised to @kathrynmohror during the Unify call yesterday, I'm opening this issue to report a problem that I observed in one of the ROMIO tests. Opening this issue also allows @adammoody to check and comment about the problem.
I first observed the problem when porting to Unify one of ROMIO's tests (noncontig.c), which is posted at
https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/noncontig.c
That program is in fact composed of three tests. The second test works right with Unify, but the first and third produce wrong results. The first and third tests write to the file in an interleaved fashion, hence I created a reproducer of the third test, which I'm posting here. This test must be run with exactly two processors. The test creates a new datatype with a stride, and then both ranks write to the same file, such that their data will be interleaved in the file. This is how each rank defines the data to be written:
buf[i] = i + myrank * 5000 ;
Running WITHOUT Unify (i.e. plain MPI), the resulting file looks like this:
(0) 0 (1) 5000 (2) 1 (3) 5001 (4) 2 (5) 5002 . . .
However, running under Unify, I obtain a resulting file with contents like this:
(0) 0 (1) 5000 (2) -1 (3) 5001 (4) -1 (5) 5002 (6) -1 (7) 5003 . . .
If I change the code to make just one of the processors to write to the file, they write correctly their part (either rank=0 or rank=1). Thus, writing with a stride by one of the processors seems to work fine. But having the two ranks writing, and forcing them to write in a certain order, does NOT change the wrong result in the file as shown above. So, the problem seems to arise when both ranks try to write to the file in the interleaved fashion.
This is the test code:
#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#ifdef UNIFY
#include <unifyfs.h>
char *filename="ufs:/unifyfs/datafile-u" ;
#else
char *filename="datafile-m";
#endif
/* tests noncontiguous reads/writes using independent I/O */
#define SIZE 5000
#define VERBOSE 1
int main(int argc, char **argv)
{
int *buf, i, myrank, nranks, b[3], ret;
MPI_Aint d[3];
MPI_File fh;
MPI_Status status;
MPI_Datatype typevec, newtype, t[3];
MPI_Info info;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
#ifdef UNIFY
ret = unifyfs_mount("/unifyfs", myrank, nranks, 0);
if (ret) {
printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, ret);
MPI_Abort(MPI_COMM_WORLD, 1);
}
#endif
if (nranks != 2) {
fprintf(stderr, "Run this program on two processes\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
buf = (int *) malloc(SIZE * sizeof(int));
MPI_Type_vector(SIZE / 2, 1, 2, MPI_INT, &typevec);
/* create a struct type with explicitly set LB and UB; displacements
* of typevec are such that the types for the two processes won't
* overlap.
*/
b[0] = b[1] = b[2] = 1;
d[0] = 0;
d[1] = myrank * sizeof(int);
d[2] = SIZE * sizeof(int);
t[0] = MPI_LB;
t[1] = typevec;
t[2] = MPI_UB;
/* keep the struct, ditch the vector */
MPI_Type_struct(3, b, d, t, &newtype);
MPI_Type_commit(&newtype);
MPI_Type_free(&typevec);
MPI_Info_create(&info);
/* I am setting these info values for testing purposes only. It is
* better to use the default values in practice. */
#if 0
MPI_Info_set(info, "ind_rd_buffer_size", "1209");
MPI_Info_set(info, "ind_wr_buffer_size", "1107");
#endif
#if VERBOSE
fprintf(stderr,
"\ntesting contiguous in memory, noncontiguous in file using independent I/O\n");
#endif
MPI_Barrier(MPI_COMM_WORLD);
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh);
/* set the file view so that we have interleaved access again */
MPI_File_set_view(fh, 0, MPI_INT, newtype, "native", info);
/* this time write a contiguous buffer */
for (i = 0; i < SIZE; i++) buf[i] = i + myrank * SIZE ;
MPI_File_write(fh, buf, SIZE, MPI_INT, &status);
MPI_Barrier(MPI_COMM_WORLD);
sleep(2);
/* fill buffer with -99's; this time they will all be overwritten */
for (i = 0; i < SIZE; i++) buf[i] = -99;
MPI_File_read_at(fh, 0, buf, SIZE, MPI_INT, &status);
fprintf(stderr,"[%d] BUF READ: %d %d %d %d %d %d\n",myrank,buf[0],buf[1],buf[2],buf[3],buf[4],buf[5]);
MPI_File_close(&fh);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Type_free(&newtype);
MPI_Info_free(&info);
free(buf);
#ifdef UNIFY
unifyfs_unmount();
#endif
MPI_Finalize();
return 0;
}
Thanks for the reproducer @clmendes . I took a quick look with totalview under a debug build of MPI. I'm just tracing some code paths without UnifyFS to see what MPI is doing. Again, we have the challenge that ROMIO will select one of many implementations based on the file system. I modified your non-UnifyFS filename above to prepend ufs: to emulate the path it'd take when using UnifyFS.
One thing I'm seeing right away is that the MPI_File_write call is implemented as a read/write operation combined with file locking at various points using fcntl. This is in ADIOI_GEN_WriteStrided in ad_write_str.c. I think what might be happening is that ROMIO is trying to lock a region of the file for exclusive access. It then reads in a block of the file, updates the portions the buffer that MPI_File_write is changing, writes that modified block of data back to the file, and releases the lock. It's a bit hard to know for sure, because they decided to go macro happy in this part of the code, which you can't trace in a debugger.
Anyway, because UnifyFS doesn't catch fcntl calls or support file locking, we'll have race conditions that show up here. In particular, both processes may try to read/write the same region of the file at the same time. I don't know if that explains what you're seeing, but that's at least one problem that will show up.
We might want to disable data sieving. Can you try a test after setting MPI_Info hints to disable data sieving?
https://pdfs.semanticscholar.org/898f/8f661cd3a1f816bad41b17f5d39e9686fdfb.pdf
romio_ds_read=disable
romio_ds_write=disable
@clmendes , another question. Do you have any UnifyFS optimizations enabled here like CLIENT_LOCAL_EXTENTS or SERVER_LOCAL_EXTENTS?
@adammoody , many thanks for looking into this. In regard to your comments:
a) I'm unclear about fcntl as you have found: didn't Rob Latham tell us that by using "ufs:" their ROMIO code would refrain from using fcntl? DId you still see "fcntl" being invoked even after adding ufs: ? b) Yes, I can give it a try with data sieving disabled, I'll try it and let you know how it goes. c) Yes, I usually run with "UNIFYFS_CLIENT_LOCAL_EXTENTS=1", I started to use that, quite a while ago, when I ran something that would reflect performance.
It looked to be calling fcntl to me even when I added "ufs:" in there. I was rushing though, so I may have missed something.
If MPI is using block-based read-modify-write, then it won't be valid for us to use our client local extents optimization. Let's also try a run with UNIFYFS_CLIENT_LOCAL_EXTENTS=0 to see whether disabling that optimization helps.
Now that I think about it, UnifyFS probably should always run with data sieving off. In a lot of cases, it probably only adds overhead for us.
https://www.mcs.anl.gov/~thakur/papers/romio-coll.pdf
@adammoody , I did some more tests today with the example code above. I started with the original code, and tried to set in my job script UNIFYFS_CLIENT_LOCAL_EXTENTS=0, but that did not make any difference in the output: the data produced in the file by one of the ranks was still wrong (i.e. values = -1). The total file size produced was correct (i.e. 40000 bytes) but, like in the past, the values did not match the file produced by plain MPI (without Unify). With Unify, I saw in the client logs only two records corresponding to "write": Rank=0: unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=0 - successful logio_write() @ log offset=0 (39996 bytes) Rank=1: unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=4 - successful logio_write() @ log offset=0 (39996 bytes)
Prior to those write records, I see the following "read" records in the log: Rank=0: unifyfs_gfid_read_reqs() [unifyfs.c:1090] read: offset:0, len:39996 Rank=1: unifyfs_gfid_read_reqs() [unifyfs.c:1090] read: offset:4, len:39996
And this was the datafile produced: -rw------- 1 mendes3 mendes3 40000 Sep 29 17:20 datafile-u
That size is correct, because the code has SIZE=5000, and each element is an "int", thus each MPI rank owns 20000 bytes. Each rank will write those elements in a strided fashion (every other element), hence the 39996 length in the log records. Thus, it looks like each rank will first read the file, adjust the data, then write to the file.
Next, I changed the code to set "romio_ds_read=disable" and "romio_ds_write=disable" in the MPI_Info_set calls. This did make a difference, but there were some problems initially. First, the values in the produced file are correct! The client log now had records like these:
Rank=0: unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=0 - successful logio_write() @ log offset=0 (4 bytes) unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=8 - successful logio_write() @ log offset=32768 (4 bytes) . . . Rank=1: unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=4 - successful logio_write() @ log offset=0 (4 bytes) unifyfs_fid_logio_write() [unifyfs-fixed.c:286] fid=1 pos=12 - successful logio_write() @ log offset=32768 (4 bytes) . . .
However, the resulting file was "shorter" than expected: -rw------- 1 mendes3 mendes3 16376 Sep 29 19:59 datafile-u
So, the program seems to have run shorter than expected. I was using these settings in my script: export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 32768) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 * 1048576)
I tried to change these values a bit, but didn't get lucky, the only way tee code works was when CHUNK_SIZE was 65536, but then the produced file was even shorter!
Next, I changed, in the source, the value of SIZE: instead of 5000, I used 500. I ran the code with and without Unify, and this time everything worked fine: the file is produced correctly, and the values match the non-Unify execution. The produced file has the correct size: -rw------- 1 mendes3 mendes3 4000 Sep 29 20:24 datafile-u
Hence, running with data sieving off does seem to solve the problem. This, of course, causes the number of writes in the execution to increase (because each write will now send just 4 bytes to the file), but the values written to the file are correct!
.
Thanks, @clmendes . That output from the log helps confirm the difference between data sieving and not, as well. Good thinking to check that.
I'm curious if forcing some fsync's in there will help, which I think we can probably get by calling MPI_File_sync. Does it help with the file size if you modify the code around the barrier to be something like:
MPI_File_sync(fh);
MPI_Barrier(MPI_COMM_WORLD);
MPI_File_sync(fh);
I have two syncs in there on purpose straddling the barrier, since that's often what you see in MPI code, but it might that just one is enough.
Oh, wait. Your idea to check CHUNK_SIZE is also provides a good clue.
Our log-based storage logic uses up a full chunk for every write operation, so each of those 4-byte writes actually consumes a full (32KB) chunk of space. Try reducing the chunk size to something small like 1024 and see if that helps. In theory, you could reduce it to something as small as 4 for this particular problem. However, each chunk also requires some internal bookkeeping, and we'll hit a limit on the number of chunks if we make the chunk size is too small.
With the SHMEM size at 64MB and a chunk size of 32KB, we'd have 64MB/32KB = 2048 chunks. So we'd exhaust shared memory storage when the process writes its 2048th integer of the 5000. After that we'd drop into spillover. What are your spillover settings?
We might have a bug in spillover or maybe we've also exhausted spillover, but then why don't we throw/catch an out-of-space error?
Although we get to the correct data by disabling the data sieving, I'm still confused about where the -1 values are coming from. I'd like to get to the bottom of that. I would expect UnifyFS to return all 0 for data that it doesn't know about (an assumed hole). Since we're getting something other than 0, there is a second problem hidden in there we'll need to figure out.
@adammoody , yes, I can certainly try the MPI_File_sync as you suggest, I'll let you know.
Just to answer quickly regarding spillover: I had set it to zero. That might be the reason for the code not running!
Without spillover, it likely would have run out of chunks. If each process wrote 2048 ints, that gives 2 * 2048 * 4 = 16,384 bytes, which is just over your observed file size of 16,376. Perhaps one chunk was already used up on each process for something else, or maybe the real limit is 2048-1. Anyway, that looks likely.
We should figure out whether or not UnifyFS reported an out-of-space error, and if so, why didn't the MPI test report that?
Just to list things in one place...
Observations:
- We should recommend that users disable data sieving. That will only add overhead in UnifyFS, since we'll read/write much more than necessary. It also leads to cases where more than one proc writes to a given file offset, even when that's not the case at the application level.
- ROMIO is still using fcntl to lock a file region here, which we don't support. The locking would be required for correctness if data sieving is enabled. So long as data sieving is disabled, I don't think it hurts that we don't actually implement locking in this particular case. However, we should keep an eye out for other places where file locking or region locking might actually be needed.
Open questions:
- Where do the -1 values come from?
- This is an extreme case of small writes, which is not space-efficient with our current logio support. Can we modify our logio to fully pack chunks before allocating new chunks?
- Is UnifyFS actually reporting ENOSPC on the first write after chunks have been exhausted? If not, we need to fix that in UnifyFS. If so, why doesn't MPI detect and report that error condition.
@adammoody , for the records, to address one of the points that had been suggested here: the MPI_File_sync did not help, in the original example (i.e. with the default data sieving enabled). I had put it around the barrier following the MPI_File_write:
MPI_File_write(fh, buf, SIZE, MPI_INT, &status); MPI_File_sync(fh); MPI_Barrier(MPI_COMM_WORLD); MPI_File_sync(fh);
The resulting data file still had -1 at the locations from one of the ranks. And, in the logs, I indeed saw the sync calls from the two ranks affter both write records:
unifyfs_fid_logio_write() [unifyfs-fixed.c:319] fid=1 pos=0 - successful logio_write() @ log offset=0 (3996 bytes) unifyfs_fid_logio_write() [unifyfs-fixed.c:319] fid=1 pos=4 - successful logio_write() @ log offset=0 (3996 bytes) invoke_client_sync_rpc() [margo_client.c:558] invoking the sync rpc function in client invoke_client_sync_rpc() [margo_client.c:558] invoking the sync rpc function in client
So, at that sync point, I believe the (wrong) data had already been sent to the file.
I was able to produce a correct file when I changed the source code to the following (still with data sieving enabled):
if (myrank==0) MPI_File_write(fh, buf, SIZE, MPI_INT, &status); MPI_File_sync(fh); MPI_Barrier(MPI_COMM_WORLD); MPI_File_sync(fh); if (myrank==1) MPI_File_write(fh, buf, SIZE, MPI_INT, &status);
This produces correct values in the file, but of course we're really changing a lot in the program's code, and in some sense this defeats the purpose of what the behavior should be originally (and about how it behaves without Unify).
@adammoody , and going back to the chunk-size issue, when I disable data sieving (to produce correct values), and move the SIZE value back to to 5000 as originally, but change my chunk settings to
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 16384) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576)
the program fails, as I had reported (it had worked with CHUNK_SIZE=32768) . The client logs contain 8192 records like these:
@ unifyfs_fid_logio_write() [unifyfs-fixed.c:319] fid=1 pos=32756 - successful logio_write() @ log offset=67076096 (4 bytes)
but then there's a record with this:
unifyfs_fid_logio_write() [unifyfs-fixed.c:302] logio_alloc(4) failed
Also, in my job's output file, I see the following message:
unifyfsd: unifyfs_request_manager.c:557: rm_pack_chunk_requests: Assertion `req_cnt < (4 * 1024)' failed. srun: error: quartz3: task 0: Aborted (core dumped)
Because the server crashes, it was not possible to extract the datafile back to persistent storage.
@adammoody , and going back to the chunk-size issue, when I disable data sieving (to produce correct values), and move the SIZE value back to to 5000 as originally, but change my chunk settings to
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 16384) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576)
the program fails, as I had reported (it had worked with CHUNK_SIZE=32768) . The client logs contain 8192 records like these:
@ unifyfs_fid_logio_write() [unifyfs-fixed.c:319] fid=1 pos=32756 - successful logio_write() @ log offset=67076096 (4 bytes)
but then there's a record with this:
unifyfs_fid_logio_write() [unifyfs-fixed.c:302] logio_alloc(4) failed
I suspect the above error is coming up when a process exhausts its chunks.
@MichaelBrim , can you tell us how to increase the number of chunks we can use?
@clmendes , can you report the return code from the MPI_File_write call? I'd like to see whether that error is working it's way back through the MPI stack. I believe UnifyFS should have returned an ENOSPC to the internal write/pwrite call that ROMIO is making.
Also, in my job's output file, I see the following message:
unifyfsd: unifyfs_request_manager.c:557: rm_pack_chunk_requests: Assertion `req_cnt < (4 * 1024)' failed. srun: error: quartz3: task 0: Aborted (core dumped)
Because the server crashes, it was not possible to extract the datafile back to persistent storage.
And this is probably coming from the read path where we're trying to issue 5000 read requests but it looks like we have a internal limit of 4096.
To get more chunks, just set LOGIO_CHUNK_SIZE even smaller, like to 64. How small you can go depends on the system page size, because we use almost a full page for the chunk storage bitmap. For example if you had 4KB pages and 64B chunks, your max shmem or spill size is around (4000 * 8)*64 bytes.
regards, mike
From: Adam Moody [email protected] Sent: Wednesday, September 30, 2020 5:34 PM To: LLNL/UnifyFS [email protected] Cc: Brim, Michael J. [email protected]; Mention [email protected] Subject: [EXTERNAL] Re: [LLNL/UnifyFS] Problem with strided writes by multiple ranks (#547)
@adammoodyhttps://github.com/adammoody , and going back to the chunk-size issue, when I disable data sieving (to produce correct values), and move the SIZE value back to to 5000 as originally, but change my chunk settings to
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 16384) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576)
the program fails, as I had reported (it had worked with CHUNK_SIZE=32768) . The client logs contain 8192 records like these:
@ unifyfs_fid_logio_write() [unifyfs-fixed.c:319] fid=1 pos=32756 - successful logio_write() @ log offset=67076096 (4 bytes)
but then there's a record with this:
unifyfs_fid_logio_write() [unifyfs-fixed.c:302] logio_alloc(4) failed
I suspect the above error is coming up when a process exhausts its chunks.
@MichaelBrimhttps://github.com/MichaelBrim , can you tell us how to increase the number of chunks we can use?
@clmendeshttps://github.com/clmendes , can you report the return code from the MPI_File_write call? I'd like to see whether that error is working it's way back through the MPI stack. I believe UnifyFS should have returned an ENOSPC to the internal write/pwrite call that ROMIO is making.
Also, in my job's output file, I see the following message:
unifyfsd: unifyfs_request_manager.c:557: rm_pack_chunk_requests: Assertion `req_cnt < (4 * 1024)' failed. srun: error: quartz3: task 0: Aborted (core dumped)
Because the server crashes, it was not possible to extract the datafile back to persistent storage.
And this is probably coming from the read path where we're trying to issue 5000 read requests but it looks like we have a internal limit of 4096.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LLNL/UnifyFS/issues/547#issuecomment-701659162, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAN5BFJVQGZAQSBCVEDMZ2LSIOQABANCNFSM4PO2KSSA.
Oh, right. Thanks, @MichaelBrim . I read @clmendes 's settings too quickly. @clmendes , with a 16KB chunk size and 64MB of total shmem space, you'd get a maximum of 64MB/16KB = 4096 chunks. That's right where your write records ran out of chunks with the 8192 records (4096 * 2 client procs) before the logio_alloc(4) failed error.
The rm_pack_chunk_requests assertion is here:
https://github.com/LLNL/UnifyFS/blob/03920169fe04404187b2db45a56f6511d96d5004/server/src/unifyfs_request_manager.c#L557
We're exceeding the value of MAX_META_PER_SEND, which is defined here (we should probably drop the KB in this definition since it's not a byte count):
https://github.com/LLNL/UnifyFS/blob/03920169fe04404187b2db45a56f6511d96d5004/common/src/unifyfs_const.h#L54
Short term, @clmendes , you could increase this limit and recompile to see if this test case exposes yet other problems.
Long term, we'll probably want to add the logic needed to be able to stream a large set of read requests through.
I mentioned on the HDF call we should be able to support data sieving, but thinking about that more, we'd also need to support file locking for that. ROMIO relies on file locking to serialize access to blocks. Without proper locking, we'd have simultaneous read/write operations from different processes to the same block.
I see I realized that above in a comment I posted two days ago. It's fun to argue with yourself, and even more fun when you lose that argument. :-/
Found the source of the -1 values...
This is showing up because both ranks attempt a read-modify-write of the same block at the same time. When they attempt to read the block, each process attempts to read simultaneously (because we don't serialize on the fcntl lock). Each process then sees the file as having 0-size (the read indicates EOF). Thus each process starts with an initial write buffer that ROMIO has intialized to be all -1, for example here in the implementation for ADIOI_GEN_WriteStrided() in mvapich2-2.3.4/src/mpi/romio/adio/common/ad_write_str.c at line 348 in mvapich-2.3.4:
memset(writebuf, -1, max_bufsize);
Each process fills in the regions of that buffer corresponding to its write operation, leaving the "holes" having with value -1, and writes that buffer back to the file.
I'm a bit puzzled why the write buffer is initialized with -1 instead of 0, since that goes against the POSIX definition for holes. I need to refer to the MPI standard to see how it defines holes.
So MPI_File_set_size has this text, which is like a POSIX hole at the end of a file that one could create with ftruncate
If size is larger than the current file size, the file size becomes size. Regions of the file that have been previously written are unaffected. The values of data in the new regions in the file (those locations with displacements between old file size and size) are undefined.
That says MPI is allowed to do whatever it wants with the data within holes, at least holes at the end of a file.
@adammoody , thanks a lot for finding the source of the "-1" for this issue!
Also, as you had suggested, I've run this example without Unify, and employed the Recorder tool. Unfortunately, I think it's not able to show the fcntl calls. This is what I get for the execution on two processors:
Rank-0: 0.011289 0.011289 0 MPI_Comm_size ( MPI_COMM_WORLD 0x7fffffffb9c4 ) 0.011298 0.011298 0 MPI_Comm_rank ( MPI_COMM_WORLD 0x7fffffffb9d8 ) 0.011337 0.011353 0 MPI_Type_commit ( 0x7fffffffb9e0 ) 0.011375 0.011457 0 MPI_Barrier ( MPI_COMM_WORLD ) 0.011461 0.014225 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/MPIIO/ISSUE547/datafile-m 9 0x7fffffffb938 0x7fffffffb9d0 ) 0.012279 0.012315 -1 open ( /etc/romio-hints 0 ) 0.012368 0.012372 0 read ( -1 0x1181bd0 4096 ) 0.012631 0.012633 0 umask ( 18 ) 0.012635 0.012635 0 umask ( 63 ) 0.012636 0.013648 7 open ( /g/g12/mendes3/MPI/MPIIO/ISSUE547/datafile-m 66 393 ) 0.014247 0.014379 0 MPI_File_set_view ( 0x89aee0 0 MPI_INT MPI_TYPE_UNKNOWN 0x41a0e8 0x7fffffffb940 ) 0.014417 0.021323 0 MPI_File_write ( 0x89aee0 0x117bac0 5000 MPI_INT [-18020_32767_0] ) 0.016772 0.016773 0 lseek ( 7 0 0 ) 0.016777 0.017903 0 read ( 7 0x11afcc0 39996 ) 0.018049 0.018051 0 lseek ( 7 0 0 ) 0.018725 0.018762 0 write ( 7 0x11afcc0 39996 ) 0.021329 30.021329 0 MPI_Barrier ( MPI_COMM_WORLD ) 32.021440 32.025090 0 MPI_File_read_at ( 0x89aee0 0 0x117bac0 5000 MPI_INT [-18020_32767_0] ) 32.021571 32.021574 0 lseek ( 7 0 0 ) 32.022327 32.024056 0 read ( 7 0x11afcc0 39996 ) 32.025148 32.025166 0 vfprintf ( 2 27 ) 32.025171 32.025331 0 MPI_File_close ( 0x7fffffffb9d0 ) 32.025226 32.025235 0 close ( 7 ) 32.025334 32.025339 0 MPI_Barrier ( MPI_COMM_WORLD )
Rank-1: 0.008721 0.008722 0 MPI_Comm_size ( MPI_COMM_WORLD 0x7fffffffb9c4 ) 0.008738 0.008738 0 MPI_Comm_rank ( MPI_COMM_WORLD 0x7fffffffb9d8 ) 0.008803 0.008834 0 MPI_Type_commit ( 0x7fffffffb9e0 ) 0.008898 0.013362 0 MPI_Barrier ( MPI_COMM_WORLD ) 0.013365 0.016142 0 MPI_File_open ( MPI_COMM_WORLD /g/g12/mendes3/MPI/MPIIO/ISSUE547/datafile-m 9 0x7fffffffb938 0x7fffffffb9d0 ) 0.014552 0.014554 0 umask ( 18 ) 0.014556 0.014557 0 umask ( 63 ) 0.014558 0.016067 6 open ( /g/g12/mendes3/MPI/MPIIO/ISSUE547/datafile-m 66 393 ) 0.016165 0.016295 0 MPI_File_set_view ( 0x728760 0 MPI_INT MPI_TYPE_UNKNOWN 0x41a0e8 0x7fffffffb940 ) 0.016331 30.023209 0 MPI_File_write ( 0x728760 0x723450 5000 MPI_INT [-18020_32767_0] ) 30.018681 30.018685 0 lseek ( 6 4 0 ) 30.018696 30.020307 0 read ( 6 0x2aaaaac320a0 39996 ) 30.020407 30.020410 0 lseek ( 6 4 0 ) 30.020949 30.020984 0 write ( 6 0x2aaaaac320a0 39996 ) 30.023217 30.023229 0 MPI_Barrier ( MPI_COMM_WORLD ) 32.023292 32.026878 0 MPI_File_read_at ( 0x728760 0 0x723450 5000 MPI_INT [-18020_32767_0] ) 32.023411 32.023413 0 lseek ( 6 4 0 ) 32.024255 32.025994 0 read ( 6 0x2aaab52f80a0 39996 ) 32.026890 32.026905 0 vfprintf ( 2 45 ) 32.026908 32.027178 0 MPI_File_close ( 0x7fffffffb9d0 ) 32.027125 32.027150 0 close ( 6 ) 32.027180 32.027255 0 MPI_Barrier ( MPI_COMM_WORLD )
And this is the job's output:
testing contiguous in memory, noncontiguous in file using independent I/O testing contiguous in memory, noncontiguous in file using independent I/O [1] BUF READ: 5000 5001 5002 5003 5004 5005 [0] BUF READ: 0 1 2 3 4 5 [Recorder] elapsed time on rank 0: 32.02
Thanks, @clmendes . Yes, it looks like fcntl is probably not yet supported in Recorder. Perhaps it's not too much trouble to add. It seems like locking is a pattern that would be helpful to know about. We should check on that.
@adammoody , I'm adding @kathrynmohror to this thread, since she had asked us to provide feedback about Recorder, such that she could send to the student.
Kathryn: I told Adam that I ran this original code (source code above) without Unify, but with Recorder, so that I could see the I/O traces. I did not set anything about data-sieving, so it should be enabled by default. Adam had debugged this a while ago, and saw that MPI-IO would use 'fcntl' calls to control correct access to the file by the two ranks. However, Recorder did not capture those 'fcntl' calls. I think it's just not configured to do it, but since it would be useful for us to see them, this might be a reasonable feedback to be sent to the tool's author.
@kathrynmohror , to expand on that request for Recorder, knowing whether a code calls fcntl at all is useful. Though as a second level of tracing, fcntl can be used for lots of different things. When used for file locking, it would be nice to know what type of lock (acquire/release, read/write) and the specific byte range that is being locked.
@adammoody , I retested this today, after making the following change that you had suggested:
In file UnifyFS/common/src/unifyfs_const.h I changed
MAX_META_PER_SEND (4 * KIB) to MAX_META_PER_SEND (16 * KIB)
Then, I rebuilt my Unify, and now the noncontig.c program works correctly under Unify! I still needed to disable Data-Sieving, but now I can run the program with the original dataset size, and no longer have a crash! The resulting datafile matches the file from the non-Unify execution.
These are the parameters that I used in my job for Unify: export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 16384) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 256 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576) (NO SPILL!)
I tested this reproducer code today on Lassen (IBM system), and although it failed with the original form, it did work fine when I changed #define SIZE 5000 so that SIZE would be 500. Thus, the code seems to work fine now, and I guess the problem for a larger value of SIZE is that I did not change MAX_META_PER_SEND in my Unify source on Lassen. The successful test on Lassen (with SIZE=500) is under directory /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Noncontig/ISSUE457
Thanks @clmendes . I think we have a good handle on the cause of this failure and its work arounds now.
To summarize:
- due to lack of support for
fcntl, UnifyFS will not currently support proper data sieving in ROMIO, which relies on file byte range locking as ranks simultaneously read-modify-write regions of a file - one should disable data sieving in ROMIO
romio_ds_read=disable romio_ds_write=disable - disabling sieving will turn a few large writes into many small writes for a heavily-interleaved file
- for many small writes, one must set
UNIFYFS_LOGIO_CHUNK_SIZEto something sufficiently small so as to not exhaust storage chunks - to support a large read over a region that was produced using many small writes, one may need to increase
MAX_META_PER_SENDas noted in https://github.com/LLNL/UnifyFS/issues/547#issuecomment-701692846