InnerEye-DeepLearning
InnerEye-DeepLearning copied to clipboard
Add file synchronization support for multiple nodes
How can we synchronize files that are written during multi-node training?
- At the end of training, each node reads the file in question, turns in to byte tensor
- Synchronize the tensor length, compute the maximum
- Each node then pads its tensor to the maximum length, and synchronizes that across all GPUs
- Node 0 can read the synced tensor, and knows the length of each tensor before padding. Can hence un-pad, turn from bytes to string and recover the file contents All of this should be wrapped into a helper class that hides all the mess:
- At the start of training, each node would create this wrapper, specifying a file name and what node rank it is running on.
- The wrapper would provide a temporary file path for the training loop to write contents into.
- When training is done, a .sync() method on the wrapper can be called, that calls the magic, and node 0 would then write the joined up file contents.
- Additional flags could be added to strip off the first line of each file that is not from node 0 (header line for CSV files)
helper function for sync:
from pl_bolts.models.self_supervised.simclr.simclr_module import SyncFunction
def synchronize_across_gpus(tensor: torch.Tensor) -> torch.Tensor:
"""
Synchronizes a tensor across all GPUs, if distributed computation is enabled. The tensors from all GPUs are stacked
up along the batch dimension (dim=0) using torch.cat. If no distributed setup is available, return the argument
unchanged.
:param tensor: The tensor that should be synchronized, of size [B, ...]
:return: If torch.distributed is enabled, return a tensor of size [B * num_GPUs, ...]. If not distributed,
return the argument of size [B, ...] unchanged.
"""
if torch.distributed.is_available() and torch.distributed.is_initialized():
synced = SyncFunction.apply(tensor)
return synced
return tensor
(from https://github.com/microsoft/InnerEye-DeepLearning/blob/antonsc/diceloss/InnerEye/ML/models/losses/soft_dice.py)
could I work on this?
Hi @aryasoni98, thanks for your interest in picking up this task! @ant0nsc has done a great job summarizing the requirements, but I'm happy to clarify further if needed.
For some context, so far we've dealt with the issue of multiple nodes writing to the same file by creating unique files per node (see here for example). The files are created within the lightning modules, so we retrieve the global rank from the trainer to create unique files. We don't yet have any code that syncs these files across nodes to create a single file.