accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

about dataloader through prepare()

Open shliu0 opened this issue 1 year ago • 6 comments

In tutorial, it is mentioned that Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers. So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?

shliu0 avatar Jan 09 '24 06:01 shliu0

So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Yes, it will include duplicated data. With 5 processes and 24 datapoints, you can see that 0 have been duplicated on process 0 and 4 on the following example:

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(24)), shuffle=False, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

### will return 

tensor([3], device='cuda:3')
tensor([8], device='cuda:3')
tensor([13], device='cuda:3')
tensor([18], device='cuda:3')
tensor([23], device='cuda:3')
tensor([2], device='cuda:2')
tensor([7], device='cuda:2')
tensor([12], device='cuda:2')
tensor([17], device='cuda:2')
tensor([22], device='cuda:2')
tensor([4], device='cuda:4')
tensor([9], device='cuda:4')
tensor([14], device='cuda:4')
tensor([19], device='cuda:4')
tensor([0], device='cuda:4')
tensor([0], device='cuda:0')
tensor([5], device='cuda:0')
tensor([10], device='cuda:0')
tensor([15], device='cuda:0')
tensor([20], device='cuda:0')
tensor([1], device='cuda:1')
tensor([6], device='cuda:1')
tensor([11], device='cuda:1')
tensor([16], device='cuda:1')
tensor([21], device='cuda:1')

Won't this property affect model performance(loss etc.) since it includes more data in train dataset?

It might but the impact will be very low since only a small part of the data will be duplicated. The maximum number of duplicated data is the number of process which is very small compared to the number of data. If you really don't want the duplicated data, the easier way is to make sure that you have a number of data proportional to the number of processes.

SunMarc avatar Jan 09 '24 17:01 SunMarc

Thanks a lot! btw, the maximum number of duplicated data is #process*#batch_size-1, rather than #process, right? just like following examples, with 3 processes, 10 datapoints and batch_size=3, I got 9 duplicated datapoints. If #process*#batch_size-1 is not very small compared to the number of data(like this case), will let shuffle=True and drop_last=True be an alternative solution?

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(10)), shuffle=False, batch_size=3)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
    print(batch)

# will return 
tensor([0, 1, 2], device='cuda:0')
tensor([9, 0, 1], device='cuda:0')
tensor([3, 4, 5], device='cuda:1')
tensor([2, 3, 4], device='cuda:1')
tensor([6, 7, 8], device='cuda:2')
tensor([5, 6, 7], device='cuda:2')

shliu0 avatar Jan 10 '24 02:01 shliu0

Yes generally that’s what we recommend doing, and then during validation we drop the extra samples during gather_for_metrics for an accurate calculation

muellerzr avatar Jan 10 '24 02:01 muellerzr

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

shliu0 avatar Jan 10 '24 03:01 shliu0

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 08 '24 15:02 github-actions[bot]

I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]

from accelerate import Accelerator
from torch.utils.data import DataLoader

accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
    for batch in dataloader:
        print(f"epoch-{epoch},{batch}")
        all_batch = accelerator.gather_for_metrics(batch)
        if accelerator.is_main_process:
            print(f"epoch-{epoch},{all_batch}")
    accelerator.wait_for_everyone()

# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')

Can someone please take a look at this?

shliu0 avatar Mar 06 '24 07:03 shliu0

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 30 '24 15:03 github-actions[bot]