EINVAL in mfu_lseek() for very large files (>16TB)
Hi there!
We're trying to dsync a set of very large files (32TB) and the copy fails with the following error:
[709] [/dev/shm/mpifileutils-0.9.1/src/common/mfu_flist_copy.c:1262] ERROR: Couldn't seek in destination path `/oak/stanford/orgs/kipac/users/swmclau2/Darksky/ds14_a_1.0000' (errno=22 Invalid argument)
Both source and destination filesystems are Lustre.
The resulting copied file's size is only 16TB, so we suspect some kind of overflow in mfu_lseek().
Ah, well, it's not an overflow in mfu_lseek(), it's just that we're hitting a stripping limit on the destination filesystem.
Because the stripping information is not compatible between the two filesystems, it's not copied, and falls back to the default stripping on the destination filesystem, which is just 1 object per file. And because the maximum size of a file on ldiskfs is 16TB, you can't have files larger than 16TB with only one stripe, hence the error.
So the fix is to pre-create the destination files and stripe them on multiple OSTs. Then the error goes away.
But that being said, the EINVAL error is a bit misleading. Would there be a way to get EFBIG (File too large) somewhere before hitting that point?
@kcgthb , I think we're just returning the error code from lseek(), which must be what lustre is handing back. Do you know or can you test whether adding a truncate() before starting to copy the file would return EFBIG? If so, perhaps we could try to truncate the inode to the correct size before starting the copy.
Another thing on our wishlist that would help in this case is to have options on dcp/dsync so that one can specify what the new striping params should be on destination files. We could have something like dstripe options.
Using ftruncate() should provide a more descriptive error (eg. EFBIG):
lseek(0, 0, SEEK_CUR) = 0
open("/oak/stanford/orgs/kipac/users/swmclau2/Darksky/.sparse32T_s1", O_RDWR|O_CREAT, 0666) = 3
dup2(3, 1) = 1
close(3) = 0
ftruncate(1, 35184372088832) = -1 EFBIG (File too large)
What we don't understand is why a sparse file of 16TiB is created (which matches the max object size of our ldiskfs backend) even though there is an error returned.
The normal sequence used in dsync for copying a file is to have one rank first create the inode with mknod(), and then different chunks of the file are assigned to ranks to be copied. Each rank starts copying data with a loop of lseek() / read() / write() calls. Finally after all ranks have copied their portion, ownership, permissions, and timestamps are set.
So here the mknod() succeeded to create the file, which would generate a 0-byte regular file. This is also the point where the striping parameters are set in the case of Lustre. Then during the copy phase, a rank detected an error when its lseek() failed. Finally, dsync does not currently delete destination files that it failed to copy, so the partially written file is left in place.
Does that line up with what you see?
We could look at adding a step or an option to check and delete partial files. That's not exactly straight-forward, though. We'd have to settle on which errors constitute a failure such that the file should be deleted. For example, we know that copying some extended attributes are expected to fail in certain cases, so errors on some attributes should be ignored.
Taking a look at the code, we do have logic to identify and delete files that failed during the actual data copy:
https://github.com/hpc/mpifileutils/commit/f10cae3e409cc885f5d3e525ef58faf3fac6ccf5
However, I disabled it two days later and just before I merged it:
https://github.com/hpc/mpifileutils/commit/b6a597e1d635d06634e19df1e01a3b45b5da945b
Off the top of my head, I can't recall what reason I had for disabling it. Anyway, that's something we need to look into again, especially if it's the case that we update the timestamps so that a subsequent dsync would miss retrying the file like we'd want it to.
Can you verify if it set the size and timestamps on the partial file to match those of the source file?
Another option could be to create the destination file sparse before starting to copy actual data, maybe?
After creating the file with mknod() and optionally applying the destination striping, trying to ftruncate() the destination file at its final size would likely trigger EFBIG earlier in the process, right?
Then, filesize comparison would need to rely on actually used blocks instead of apparent size, but maybe that's the case already.