fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Got empty batch when using multiple gpus resuming from a checkpoint

Open 18445864529 opened this issue 4 years ago • 3 comments

When I resume training from a saved checkpoint with 4 GPUs. In the main loop of training iterations (i.e., for i, samples in enumerate(progress):, I got an empty batch samples=[{}] at the beginning of the fetching for 1 out of 4 GPUs. But if I use 1 GPU with the same code, there will be no empty batch.

An even more weird phenomenon is, I tested with: (also resuming from the checkpoint)

for samples in progress: 
    print(samples)

When using 1 GPU, the behavior was normal, keep enumerating the dataloader and printing. But when using 4 GPUs, there were only 4 pieces of outputs, with 1 being the empty batch [{}], as if there were only 3 batches in the dataset (which is not true as one epoch actually contains hundreds of batches).

Any clue about this issue? Thank you in advance.

18445864529 avatar Nov 04 '21 09:11 18445864529