--synchronous not handling the last block correctly
Running mpifileutils 0.10 on Ubuntu 5.3.0-24-generic
Here is a simple ext4 -> ext4 single file copy with O_DIRECT:
$ mpirun -np 1 dcp --synchronous -p -v /raid/datasets/train-val-recordio-passthrough/val.rec /raid/scratch/val.rec
[2020-06-16T17:06:57] Using synchronous read/write (O_DIRECT)
[2020-06-16T17:06:57] Preserving file attributes.
[2020-06-16T17:06:57] Warning: `/raid/scratch/val.rec' does not exist
[2020-06-16T17:06:57] Walking /raid/datasets/train-val-recordio-passthrough/val.rec
[2020-06-16T17:06:57] Walked 1 items in 0.000216 secs (4629.865410 items/sec) ...
[2020-06-16T17:06:57] Walked 1 items in 0.000255 seconds (3926.310995 items/sec)
[2020-06-16T17:06:57] Copying to /raid/scratch/val.rec
[2020-06-16T17:06:57] Items: 1
[2020-06-16T17:06:57] Directories: 0
[2020-06-16T17:06:57] Files: 1
[2020-06-16T17:06:57] Links: 0
[2020-06-16T17:06:57] Data: 6.248 GB (6.248 GB per file)
[2020-06-16T17:06:57] Creating directories.
[2020-06-16T17:06:57] level=4 min=0 max=0 sum=0 rate=0.000000/sec secs=0.000001
[2020-06-16T17:06:57] Created 0 directories in 0.000005 seconds (0.000000 items/sec)
[2020-06-16T17:06:57] Creating files.
[2020-06-16T17:06:57] level=4 min=1 max=1 sum=1 rate=26474.637297 secs=0.000038
[2020-06-16T17:06:57] Created 1 items in 0.000070 seconds (14252.526260 items/sec)
[2020-06-16T17:06:57] Copying data.
ABORT: rank X on HOST: Failed to read file /raid/datasets/train-val-recordio-passthrough/val.rec errno=22 (Invalid argument) @ /usr/local/src/mpifileutils/src/common/mfu_io.c:371
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Under strace:
[pid 39414] read(16, "\360\233^e\246\323K\\\341\222\342lv\331\16\252\267\275\307v\326>\317Q\354\377\0\203\330\241\327s"..., 1048576) = 1048576
[pid 39414] write(17, "\360\233^e\246\323K\\\341\222\342lv\331\16\252\267\275\307v\326>\317Q\354\377\0\203\330\241\327s"..., 1048576) = 1048576
[pid 39414] read(16, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
[pid 39414] write(17, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
[pid 39414] read(16, 0x563575900000, 494700) = -1 EINVAL (Invalid argument)
[pid 39414] read(16, 0x563575900000, 494700) = -1 EINVAL (Invalid argument)
[pid 39414] read(16, 0x563575900000, 494700) = -1 EINVAL (Invalid argument)
[pid 39414] read(16, 0x563575900000, 494700) = -1 EINVAL (Invalid argument)
[pid 39414] read(16, 0x563575900000, 494700) = -1 EINVAL (Invalid argument)
[pid 39414] write(2, "ABORT: rank X on HOST: ", 23ABORT: rank X on HOST: ) = 23
The last read is not properly aligned, from man 2 read:
EINVAL [...] or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the file offset is not suitably aligned.
Testing with dd(1): the last block is still read with the block size, and the truncation is reported by the syscall. This new size is used for the subsequent write(2), which accepts this non-aligned length just fine:
$ strace -f -e read,write dd bs=1M iflag=direct oflag=direct if=/raid/datasets/train-val-recordio-passthrough/val.rec of=/raid/scratch/val.rec
[...]
read(0, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
write(1, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
read(0, "\264\245IU#\17\306\322j\256\205\320\216\391\235\222)\344U\215\2667\269\\\216\276\325Qb"..., 1048576) = 494700
write(1, "\264\245IU#\17\306\322j\256\205\320\216\391\235\222)\344U\215\2667\269\\\216\276\325Qb"..., 494700) = 494700
Thanks for the report and the helpful debugging information, @flx42 . Yes, as you suggest, I believe the problem is that the last read should still be using a count=1MB even though it knows there are only 494700 bytes remaining in the file.
On some file systems, the I think the final write must also use a count that is aligned.
man 2 write EINVAL fd is attached to an object which is unsuitable for writing; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the current file offset is not suitably aligned.
Based on your strace of the dd test, it doesn't look like your ext4 has a problem with that. We should fix the last read and double check the code for the final write.
I think the code to deal with the count on the final write is good. I haven't had a chance to test it, but this patch might solve the read size case:
diff --git a/src/common/mfu_flist_copy.c b/src/common/mfu_flist_copy.c
index b4039c9..b697b04 100644
--- a/src/common/mfu_flist_copy.c
+++ b/src/common/mfu_flist_copy.c
@@ -1591,6 +1591,10 @@ static int mfu_copy_file_normal(
if(left_to_read > buf_size) {
left_to_read = buf_size;
}
+ if(mfu_copy_opts->synchronous) {
+ /* O_DIRECT requires particular read sizes */
+ left_to_read = buf_size;
+ }
/* read data from source file */
ssize_t num_of_bytes_read = mfu_read(src, in_fd, buf, left_to_read);
It doesn't seem to work:
[pid 32019] read(16, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
[pid 32019] write(17, "\355[fUl\313\374\331\314*\246\256\237\16\246\331\225\370\356#>\\\240\224\365\215x>q-\247\253"..., 1048576) = 1048576
[pid 32019] read(16, "\264\245IU#\17\306\322j\256\205\320\216\391\235\222)\344U\215\2667\269\\\216\276\325Qb"..., 1048576) = 494700
[pid 32019] read(16, 0x555c07178c6c, 553876) = -1 EINVAL (Invalid argument)
[pid 32019] read(16, 0x555c07178c6c, 553876) = -1 EINVAL (Invalid argument)
[pid 32019] read(16, 0x555c07178c6c, 553876) = -1 EINVAL (Invalid argument)
[pid 32019] read(16, 0x555c07178c6c, 553876) = -1 EINVAL (Invalid argument)
[pid 32019] read(16, 0x555c07178c6c, 553876) = -1 EINVAL (Invalid argument)
[pid 32019] write(2, "ABORT: rank X on HOST: ", 23ABORT: rank X on HOST: ) = 23
[pid 32019] write(2, "Failed to read file /raid/datase"..., 101Failed to read file /raid/datasets/train-val-recordio-passthrough/val.rec errno=22 (Invalid argument)) = 101
[pid 32019] write(2, " @ /usr/local/src/mpifileutils/s"..., 55 @ /usr/local/src/mpifileutils/src/common/mfu_io.c:371
It's trying to read the "rest": 1048576 - 494700 = 553876
Looks like it's hitting the retry and then abort logic in mfu_read:
https://github.com/hpc/mpifileutils/blob/57a3578b68d8df0fa53a7662c33c6c21ce72400f/src/common/mfu_io.c#L350-L381
Thanks for trying, @flx42 . Oh, right. I forgot about that. The read is set up to assume that you really want to read in all of the bytes requested and considers short reads to be an error. So a fix will take a bit more effort.
Looking through the code, I think all places calling mfu_read() were correctly handling short reads, so it's safe to return from that function early. I've pushed a commit that also includes that change.
I've got a PR going here: https://github.com/hpc/mpifileutils/pull/343
So far, none of the file systems that I have easy access to reproduce the read error, so I haven't been able to test this fix. I'll keep looking to see if I can reproduce it. Mind giving that another shot?
@flx42 I've rewritten the direct I/O code to use pread/pwrite instead of read/write in https://github.com/hpc/mpifileutils/pull/380. It's not quite ready yet, but when it is ready, it would help if you wouldn't mind running your test case again to make sure I didn't break things for you.
Ok, it should be ready to go now. The goal here was to handle short read/writes in O_DIRECT by retrying using the same buf, offset, and length to satisfy alignment requirements. That was cleaner to do using pread/pwrite rather than using read/write with a bunch of lseek operations. If you're able to test, that'd be helpful, but if not, that's fine too. Thanks either way.
I was only able to do a brief test, but it seemed fine! Thanks a lot!