dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[bug] dgl.dataloading.DataLoader forwards args to torch.utils.data.DataLoader but many are not valid

Open nv-dlasalle opened this issue 3 years ago • 1 comments

🐛 Bug

Currently, the DGL's DataLoader is a child of PyTorch's DataLoader, and passes kwargs to it, however DGL's dataloader does several things under the hood which breaks many of these options, and no indication is given to the user what is valid and what is not. Further more, CI covers a fraction of possible options users my pass in.

Current problematic options:

  • sampler and batch_sampler: Specifying either of these results in errors about IterableDataset. Our documentation explicitly references sampler inside the description for the use_ddp argument.
  • collate_fn: This results in a duplicate keyword error.
  • pin_memory: This results in a duplicate keyword error, as the super is called with pin_memory=self.pin_prefetcher.
  • pin_memory_device (new in 1.12): This will only have effect on the input to the _PrefetchingIterator, which may not affect the data as returned to the user.
  • generator: Has no effect.
  • timeout: While this will work relatively correctly with respect to the workers, when a prefetching thread is active, it is possible to block on the dataloader for longer than the specified timeout, which would be unexpected from the user's perspective.
  • worker_init_fn: This results in a duplicate keyword error, however this appears just to be a bug as we attempt to correctly wrap the passed in function, but fail to delete it from the kwargs.
  • batch_size=None: While it would be odd for a user to do this given we don't provide users a way to manually batch data, our documentation does address it when it is valid for PyTorch's DataLoader.

We also have the issue of when PyTorch's DataLoader is updated, how to maintain compatibility. For example, the option pin_memory_device was just added in 1.12.

Proposed fix

I think the best way to fix this, is to remove the kwargs parameter to our DataLoader, and instead capture every argument explicitly. This way we can limit, verify, and test the different argument combinations passed in. It will also make reading our documentation easier.

Alternative fix

We could try to keep documentation on what parameters are valid, but we may need separate documentation per PyTorch version.

nv-dlasalle avatar Sep 02 '22 00:09 nv-dlasalle

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Oct 02 '22 01:10 github-actions[bot]

Any updates? I also have the problem of passing worker_init_fn.

I manage to remove the **kwargs and the duplicate keyword error is gone. However, the worker_init_fn does not seem to work.

llCurious avatar Mar 02 '23 15:03 llCurious

Any updates? I also have the problem of passing worker_init_fn.

I manage to remove the **kwargs and the duplicate keyword error is gone. However, the worker_init_fn does not seem to work.

This is being addressed in #5420.

BarclayII avatar Mar 03 '23 10:03 BarclayII

Thanks for your reply. The PR above indeed solves the duplicate keyword error.

llCurious avatar Mar 03 '23 10:03 llCurious