accelerate
accelerate copied to clipboard
about dataloader through prepare()
In tutorial, it is mentioned that Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers. So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?
So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Yes, it will include duplicated data. With 5 processes and 24 datapoints, you can see that 0 have been duplicated on process 0 and 4 on the following example:
from accelerate import Accelerator
from torch.utils.data import DataLoader
accelerator = Accelerator()
dataloader = DataLoader(list(range(24)), shuffle=False, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
print(batch)
### will return
tensor([3], device='cuda:3')
tensor([8], device='cuda:3')
tensor([13], device='cuda:3')
tensor([18], device='cuda:3')
tensor([23], device='cuda:3')
tensor([2], device='cuda:2')
tensor([7], device='cuda:2')
tensor([12], device='cuda:2')
tensor([17], device='cuda:2')
tensor([22], device='cuda:2')
tensor([4], device='cuda:4')
tensor([9], device='cuda:4')
tensor([14], device='cuda:4')
tensor([19], device='cuda:4')
tensor([0], device='cuda:4')
tensor([0], device='cuda:0')
tensor([5], device='cuda:0')
tensor([10], device='cuda:0')
tensor([15], device='cuda:0')
tensor([20], device='cuda:0')
tensor([1], device='cuda:1')
tensor([6], device='cuda:1')
tensor([11], device='cuda:1')
tensor([16], device='cuda:1')
tensor([21], device='cuda:1')
Won't this property affect model performance(loss etc.) since it includes more data in train dataset?
It might but the impact will be very low since only a small part of the data will be duplicated. The maximum number of duplicated data is the number of process which is very small compared to the number of data. If you really don't want the duplicated data, the easier way is to make sure that you have a number of data proportional to the number of processes.
Thanks a lot! btw, the maximum number of duplicated data is #process*#batch_size-1, rather than #process, right? just like following examples, with 3 processes, 10 datapoints and batch_size=3, I got 9 duplicated datapoints. If #process*#batch_size-1 is not very small compared to the number of data(like this case), will let shuffle=True and drop_last=True be an alternative solution?
from accelerate import Accelerator
from torch.utils.data import DataLoader
accelerator = Accelerator()
dataloader = DataLoader(list(range(10)), shuffle=False, batch_size=3)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
print(batch)
# will return
tensor([0, 1, 2], device='cuda:0')
tensor([9, 0, 1], device='cuda:0')
tensor([3, 4, 5], device='cuda:1')
tensor([2, 3, 4], device='cuda:1')
tensor([6, 7, 8], device='cuda:2')
tensor([5, 6, 7], device='cuda:2')
Yes generally that’s what we recommend doing, and then during validation we drop the extra samples during gather_for_metrics for an accurate calculation
I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]
from accelerate import Accelerator
from torch.utils.data import DataLoader
accelerator = Accelerator()
dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True)
dataloader = accelerator.prepare(dataloader)
for epoch in range(1):
for batch in dataloader:
print(f"epoch-{epoch},{batch}")
all_batch = accelerator.gather_for_metrics(batch)
if accelerator.is_main_process:
print(f"epoch-{epoch},{all_batch}")
accelerator.wait_for_everyone()
# will return
epoch-0,tensor([0, 1, 2], device='cuda:0')
epoch-0,tensor([3, 4, 5], device='cuda:1')
epoch-0,tensor([6, 7, 8], device='cuda:2')
epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I think passing drop_last=True through Dataloader may cause some problem after prepare(), as following, the gathered batch expected to be [0, 1, 2, 3, 4, 5, 6, 7, 8], rather than [0, 1, 2, 3, 4, 5, 6, 7]
from accelerate import Accelerator from torch.utils.data import DataLoader accelerator = Accelerator() dataloader = DataLoader(list(range(17)), shuffle=False, batch_size=3, drop_last=True) dataloader = accelerator.prepare(dataloader) for epoch in range(1): for batch in dataloader: print(f"epoch-{epoch},{batch}") all_batch = accelerator.gather_for_metrics(batch) if accelerator.is_main_process: print(f"epoch-{epoch},{all_batch}") accelerator.wait_for_everyone() # will return epoch-0,tensor([0, 1, 2], device='cuda:0') epoch-0,tensor([3, 4, 5], device='cuda:1') epoch-0,tensor([6, 7, 8], device='cuda:2') epoch-0,tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0')
Can someone please take a look at this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.