mpifileutils dcp: more efficient task distribution for large files

If many ranks are copying a large file in Lustre, the chunk distribution should prefer to assign chunks from the same stripe to a given rank to avoid DLM lock thrashing. The DLM lock thrashing is described in a comment in issue #114, but that ticket is related to directory locking, so the shared-file-write locking should be in a separate ticket.

Ideally for Lustre the file would default to using chunk_size == stripe_size, then distribute the work across stripe_count ranks in round-robin fashion so that each rank only has a single object that it is copying and there is no DLM lock contention.

I'd like to discuss possible approaches for implementing this, since I don't know anything about how the workload distribution is done today.

Jul 03 '19 18:07 adilger

Yes, that capability would be great to have. Each file may potentially have a different set of striping parameters, so I think we'd want to be able to associate the striping parameters with each file.

To describe how things currently work with dcp and dsync, files are logically sliced into chunks given a fixed chunk_size. Given a set of files in an flist, which may just be one single large file or a set of many files of arbitrary size, the ranks first compute the total number of chunks across all files. We define a 0-byte file to be one chunk, otherwise we compute the number of chunks in a file as ceil(file size / chunk_size). The chunks are then evenly assigned to processes in block form. As an example, imagine that the flist consists of the following on two ranks:

rank 0: file_0 (0 bytes), file_1 (1MB + 1) rank 1: file_2 (0.5MB)

If chunk_size=1MB, then this would give the number of chunks as:

rank 0: file_0 (1 chunk), file_1 (2 chunks) rank 1: file_2 (1 chunk)

A total of 4 chunks, so that each rank would be assigned 2. Each rank would then be responsible for the following file portions:

rank 0: file_0 (offset 0, length 0), file_1 (offset 0, length 1MB) rank 1: file_1 (offset 1MB, length 1), file_2 (offset 0, length 0.5MB)

The math that does this is in this (long) function: https://github.com/hpc/mpifileutils/blob/2de9b3076cc7738eeae68a3d0000f408620bc308/src/common/mfu_flist_chunk.c#L38

It returns a linked list of file portions, where each element gives the file name, starting offset, and length of each file portion the rank is responsible for. If a process ends up with consecutive chunks from a file, we merge those into a single element in this linked list by extending the length appropriately.

Jul 03 '19 19:07 adammoody

While on the topic, to know whether each file was copied successfully, we then need to know whether all portions of a given file succeeded. To help with that, we have this function that executes a logical OR operation across all portions for a file: https://github.com/hpc/mpifileutils/blob/2de9b3076cc7738eeae68a3d0000f408620bc308/src/common/mfu_flist_chunk.c#L440

If any process hits an error, it can set the flag for its portion to 1. Then, this function will notify the process which owns that particular file in the flist that there was a problem.

Jul 03 '19 19:07 adammoody

It surprises me that dcp would traverse and process all of the files before allocating work to ranks.

It seems like it would be better to have an on-demand workload distribution, where there are processes crawling the tree to find directories and files, the files are queued for other ranks to copy, and each rank may chunk up the file but prefers to process the resulting chunks for a single file locally as long as there is enough work to keep the other ranks busy.

The other ranks would only steal chunks from the initial rank processing a file if there were no more files/directories to process, and they had copied all of the chunks in their files (eg. if a huge file is processed at the end of the traversal).

Jul 04 '19 00:07 adilger

Right, the original dcp (now named dcp1 in mpiFileUtils) uses the type of fully dynamic work distribution algorithm that you're describing here. It combines walking the directory structure, creating inodes, and copying file chunks all with dynamic work-stealing. And as you suggest, it does indeed perform better in a number of situations, especially since it can better overlap different stages like creating file inodes while simultaneously copying data chunks for other files.

In the past, we ran into some problems with lustre lock contention when operating on a single large file with that algorithm. At the time, we couldn't seem to work around it by changing chunking as we expected.

As an alternative, we created dcp to break work into phases and assign copying chunks in this static fashion. That solved the lustre lock contention we were running into at the time. There were some other benefits to dividing the work into phases like this. We can easily time the cost to walk, create inodes, and copy data as separate phases, which can then help identify various bottlenecks (slow walk vs slow create vs slow copy?). Also, it makes it easy to skip steps. For example if another tool has already done the walk and saved the information, we can easily skip the walk step in dcp and jump straight to the inode creation. Also, it's easier to track and record errors for each file when its chunks are distributed in this controlled way vs having chunks serviced by a random set of ranks. For the random copying, each rank would need to report its success or failure to the file's owner. To do that at scale, you'd probably then need to do that in a tree. It's possible, but just not as easy.

Anyway, there is value in both approaches. Our long term goal is to support both schemes in a single tool. The tool could either chose the best algorithm at run time, or the user could declare which one to use through command line options. For now, we just keep both options around as separate tools (dcp and dcp1).

Finally, one can imagine hybrid approaches. For example, we could still keep the walk, inode creation, and data copy tasks as distinct phases, but each of these tasks could internally use a dynamic work-stealing method via libcircle.

Jul 04 '19 04:07 adammoody