pytorch-styleguide icon indicating copy to clipboard operation
pytorch-styleguide copied to clipboard

Should we use BackgroundGenerator when we've had DataLoader?

Open yzhang1918 opened this issue 5 years ago • 6 comments

I really enjoy this guide! However, I am not sure what the advantage of prefetch_generator is. It seems that DataLoader in pytorch has already supported prefetching.

Thank you!

yzhang1918 avatar Apr 29 '19 10:04 yzhang1918

To the best of my knowledge, the DataLoader in Pytorch is creating a set of worker threads which all prefetch new data at once when all workers are empty.

So if for example, you create 8 worker threads:

  1. All 8 threads prefetch data
  2. Until you empty all of them (make for example 8 train iterations) none of the workers fetches new data

Using the prefetch generator we make sure that each of those workers always has at least 1 additional data item loaded.

You can see this behavior if you create a very shallow network.

I have here two colab notebooks (based on the CIFAR10 example from the official tutorial):

Here with data loader and 2 workers: https://colab.research.google.com/drive/10wJIfCw5moPc-Yx9rSqWFEXkNceAOPpc

Here with the additional prefetch_generator: https://colab.research.google.com/drive/1WQ8c-RIZ7FMhfsm8dtRpsqiIR_KuZ49Z

Output without prefetch_generator Output with prefetch_generator
Compute efficiency: 0.09, iter 1 Compute efficiency: 0.61, iter 1
Compute efficiency: 0.98, iter 2 Compute efficiency: 0.99, iter 2
Compute efficiency: 0.61, iter 3 Compute efficiency: 0.98, iter 3
Compute efficiency: 0.98, iter 4 Compute efficiency: 0.99, iter 4
Compute efficiency: 0.67, iter 5 Compute efficiency: 0.99, iter 5
Compute efficiency: 0.71, iter 6 Compute efficiency: 0.99, iter 6
Avg time per epoch: 328ms Avg time per epoch: 214ms

This is why keeping track of computing vs data loading time (aka compute efficiency) is important. In this simple example, we even save lots of training time.

If anyone knows how to fix this behavior in the PyTorch data loader let me know :)

IgorSusmelj avatar Apr 30 '19 07:04 IgorSusmelj

Thank you for your wonderful example! Now I use the following class to replace the default DataLoader everywhere in my code. XD

from torch.utils.data import DataLoader
from prefetch_generator import BackgroundGenerator

class DataLoaderX(DataLoader):

    def __iter__(self):
        return BackgroundGenerator(super().__iter__())

yzhang1918 avatar May 23 '19 06:05 yzhang1918

I had a problem using BackgroundGenerator with PyTorch Distributed Data Parallel. When I turn DDP and BackgroundGenerator both on and iterate dataloader, processes that are not in rank 0 loaded something to rank 0 GPU. I solved this issue by turning off BackgroundGenerator when I use DDP.

ryul99 avatar Jun 23 '20 10:06 ryul99

the DataLoader in Pytorch is creating a set of worker threads

Technically no, it creates worker processes

Until you empty all of them (make for example 8 train iterations) none of the workers fetches new data

pytorch does not do this

I have here two colab notebooks (based on the CIFAR10 example from the official tutorial): Here with data loader and 2 workers: https://colab.research.google.com/drive/10wJIfCw5moPc-Yx9rSqWFEXkNceAOPpc Here with the additional prefetch_generator:

This is a flawed benchmark that doesn't actually show the importance of prefetching -- it runs the fastest without any prefetching: when setting num_workers=0 and do NOT use BackgroundGenerator, it prints 150ms, faster than what's in both colab notebook.

ppwwyyxx avatar Dec 12 '20 11:12 ppwwyyxx

A quick update on this one. PyTorch 1.7 introduced a configurable prefetching parameter for the DataLoader https://pytorch.org/docs/stable/data.html

I didn't do any benchmarking yet. But I can imagine that the integrated prefetching makes this prefetch_generator obsolete for PyTorch.

IgorSusmelj avatar Dec 18 '20 20:12 IgorSusmelj

I had a problem using BackgroundGenerator with PyTorch Distributed Data Parallel. When I turn DDP and BackgroundGenerator both on and iterate dataloader, processes that are not in rank 0 loaded something to rank 0 GPU. I solved this issue by turning off BackgroundGenerator when I use DDP.

I got exactly the same problem. But thurning off BackgroundGenerator in DDP would make the data sample phase much slower. Is there any better solutions for this?

DZ9 avatar Mar 16 '21 08:03 DZ9