fairseq
fairseq copied to clipboard
Got empty batch when using multiple gpus resuming from a checkpoint
When I resume training from a saved checkpoint with 4 GPUs. In the main loop of training iterations (i.e., for i, samples in enumerate(progress):, I got an empty batch samples=[{}] at the beginning of the fetching for 1 out of 4 GPUs. But if I use 1 GPU with the same code, there will be no empty batch.
An even more weird phenomenon is, I tested with: (also resuming from the checkpoint)
for samples in progress:
print(samples)
When using 1 GPU, the behavior was normal, keep enumerating the dataloader and printing. But when using 4 GPUs, there were only 4 pieces of outputs, with 1 being the empty batch [{}], as if there were only 3 batches in the dataset (which is not true as one epoch actually contains hundreds of batches).
Any clue about this issue? Thank you in advance.