Lustre directory locking
@dinatale2 mentioned there may be some issues with directory locking / lock thrashing. Hopefully he can provide more details.
So I'm envisioning a case where some of the mpifileutils tools that break down files into chunks may compete for at least the parent directory lock. This wouldn't be a problem if it's only a handful (say 100s of tasks) attempting to open a file, but imagine if a set of extremely large files with 1000s of ranks attempting to open a single file. That might be a lot of contention.
Not sure if it's a problem or not, but figured I'd throw this out there if it turns out to be a concern. If it is a problem, we could easily randomize the order in which a rank writes chunks potentially.
@dinatale2 @gonsie @adammoody Hmm.. I believe that was exactly the issue we were seeing with lustre in dcp1 for very large files. This (I believe should be resolved) in dcp. dcp fixed this issue by having multiple file chunks (from the same file) moved to the same process.
Or, I should say each file has a "process owner" and keeps track of where its file chunks are. @adammoody can correct me if I'm wrong.
@dsikich I'll dig thru the code when I have a few minutes to see what I can glean.
@dinatale2 great, sounds good.
@dinatale2 any idea if this was resolved?
The topic of this issue is a bit confusing. It mentions "directory locking", but then discusses thousands of ranks opening the same file? As long as they are not opening the file with O_CREAT they shouldn't cause contention. That said, there are probably diminishing returns from having so many ranks opening a single file, so there should probably be some empirical limit put on how many ranks are used to copy a single file.
That said, there is a potential issue with directory locking that isn't mentioned here. With rm -r style workloads, there can be thrashing of the directory locking when clients do small readdir() requests and then unlink() (which causes the directory to be updated by the MDS and revoke the directory lock). If drm doesn't already do so, it should do the full readdir() operation first to get all of the filenames (if possible, or at least as many as possible for the number of clients), and then do unlink() as a separate step.
Thanks for the input @adilger ! drm now supports two modes.
The default behavior is that drm will readdir all entries to get the full list of files and then start to unlink. It sounds like this mode is the preferred mode given what you say here. Actually in this case, it calls mfu_flist_walk() to build the full list of files over the entire directory tree and then it calls mfu_flist_unlink() to delete everything.
By the way, the mfu_flist_unlink() function has something like 6 different implementations that we tried. We settled on a particular version and commented the rest out. If some file systems do better with different unlink schemes, it's nice to have the other versions handy. If you have insight here, that'd be useful.
In v0.9, we just added a second mode that delete files during the walk, which is enabled when using --aggressive. We added this to help with users who were deleting so many files that the full list could not fit into memory. This second mode calls unlink() immediately after readdir(), but that sounds like it will not be efficient on Lustre.
It probably makes sense to have a bit of a hybrid. Going from "load everything into memory" to "unlink each file" is pretty drastic. Reading at least a whole directory at a time into memory should always be possible these days (even a 1B file directory would be around 256GB of RAM), to avoid the most lock contention. Probably best would be "prefetch a large number of entries into client(s) RAM, then distribute and do depth-first unlink, repeat as necessary", ideally with incremental buffer refill.