mpifileutils Inefficiency in dsync when restarting a failed dsync

As far as I can judge, a dsync that is aborted for some reason and then restarted will delete all files and make new copies. The reason being that owner/permission/mod-time doesn't match the source.

If the meta-data on files is set directly after the data has been copied, instead of in a separate loop after coping all data, this would not be a problem.

Nov 05 '20 12:11 akesandgren

I do believe the --batch-files option would alleviate this somewhat. I'm not sure of all the repercussions of setting the meta data directly after copying, but one that I can think of is the case where multiple ranks are writing to the same file, and that file should NOT have write permission. In this case, the file must have write permission during the copy, but after copying, write permission will be removed. But if multiple ranks are writing to that same file, then it becomes messy/expensive to know when all ranks are done with that specific file. Similarly, for mod-time, ranks would have to coordinate when they are all done with a specific file, so that the times can be updated. Hence, the metadata is set after an entire batch of files. To be able to support setting metadata directly after copying, I think there would need to be some way to ensure that only a single rank works on each file, which has performance repercussions as well.

Jan 12 '21 19:01 daltonbohning

For POSIX-compliant filesystems, it is definitely possible to open(O_CREAT|O_RDWR, 0444) a new file, so that the file descriptor has write permissions but the file itself has only read permissions.

This is more tricky with MPI opening the same file from multiple nodes, but at least it should be possible to change the permission on the file after it has been opened by the various ranks doing the copy. The file definitely does not need to have write access permission during the copy as long as it had write permission at the time it was opened (though I guess it is possible there is some strangeness/incompatibility with NFS or other non-POSIX filesystems).

Jan 12 '21 19:01 adilger

For POSIX-compliant filesystems, it is definitely possible to open(O_CREAT|O_RDWR, 0444) a new file, so that the file descriptor has write permissions but the file itself has only read permissions.

When creating, yes, but is this also true for existing files?

though I guess it is possible there is some strangeness/incompatibility with NFS or other non-POSIX filesystems

I am not sure about this :)

I'm currently going through the dsync codebase now so I can add DAOS support. If dsync could do something like this, that would definitely be a great performance improvement. I'll keep this in mind, and report back if I see anything that suggests that this either can or cannot be done.

Jan 12 '21 19:01 daltonbohning

Yes, I'm all for looking for further optimization here.

As @daltonbohning mentioned, the current implementation sets the permission bits on the files after all writing is complete. We mknod() all file inodes before anyone starts to write. Those inodes are created with read/write bits enabled, regardless of what the final permission bits should be. Then when opening the files for writing, we do not use O_CREAT -- just O_WRONLY. After we know all ranks are done writing, we set the permission bits to their actual values (which may disable the write bit). Finally, the atime/mtime values are set as the very last step.

For dealing with a large set of files, we added the --batch-files option as a type of checkpoint. The intent is to process a large set of files by completing them in smaller batches, and then if interrupted, dsync can restart by picking up after its most recently completed batch. That's not as efficient as finalizing each file the instant it is done, but it's a step in that direction.

It would be great if we can find a reliable and efficient way to coordinate when to set metadata like permission bits and timestamps on shared files. Though if that's not possible, another option might be to avoid sharing files, at least in cases where we find the total file count is much higher than the number of ranks so that we can get decent work load balance.

Jan 12 '21 20:01 adammoody

It is not possible to open files for write after the initial open(O_CREAT) if they don't have write permission (maybe as root, I'm not sure). So one option would be to open the file O_RDWR on the ranks that will be doing the copy, then change the permission, then actually write the file. That avoids the much larger window of having files with the wrong mode, and all "incorrect" files would be zero-length and easily handled.

Another option is to use open(O_TMPFILE, ...) to create an invisible (open-unlinked) file descriptor to handle the file write (mode doesn't matter since it will not be visible in the namespace), and the file is automatically deleted if the copy processes crash, or linkat(...) to link it into the namespace once the copy is done. Lustre does not currently support O_TMPFILE, though it allows a special "volatile filename" (that predates O_TMPFILE) to allow creating files that do not appear in the namespace. I haven't checked yet if these can be linked afterward with linkat(), but if not this could be fixed along with the addition of O_TMPFILE (https://jira.whamcloud.com/browse/LU-9512). That would avoid much of the need for "batches" entirely, since files would not appear in the namespace until they are completed.

Note that linkat() (since kernel 2.6.16) and O_TMPFILE (since kernel 3.11) are Linux-specific, but would at both be available in at least RHEL7+ clients.

Jan 12 '21 20:01 adilger