fairseq Got empty batch when using multiple gpus resuming from a checkpoint

Got empty batch when using multiple gpus resuming from a checkpoint

Open 18445864529 opened this issue 4 years ago • 3 comments

When I resume training from a saved checkpoint with 4 GPUs. In the main loop of training iterations (i.e., for i, samples in enumerate(progress):, I got an empty batch samples=[{}] at the beginning of the fetching for 1 out of 4 GPUs. But if I use 1 GPU with the same code, there will be no empty batch.

An even more weird phenomenon is, I tested with: (also resuming from the checkpoint)

for samples in progress: 
    print(samples)

When using 1 GPU, the behavior was normal, keep enumerating the dataloader and printing. But when using 4 GPUs, there were only 4 pieces of outputs, with 1 being the empty batch [{}], as if there were only 3 batches in the dataset (which is not true as one epoch actually contains hundreds of batches).

Any clue about this issue? Thank you in advance.

Nov 04 '21 09:11 18445864529

fairseq fairseq copied to clipboard

Got empty batch when using multiple gpus resuming from a checkpoint

fairseq
fairseq copied to clipboard