tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Tutorial on custom dataloaders (NOT datasets)

Open mhdadk opened this issue 4 years ago • 9 comments

I really like this tutorial on custom datasets. However, the torch.utils.data.DataLoader class is only briefly mentioned in it:

However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

  • Batching the data
  • Shuffling the data
  • Load the data in parallel using multiprocessing workers.

torch.utils.data.DataLoader is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn . You can specify how exactly the samples need to be batched using collate_fn . However, default collate should work fine for most use cases.

I am aware of this issue and this issue but neither have led to a tutorial.

I am happy to make a tutorial on custom dataloaders using the torch.utils.data.DataLoader class, focusing on how to interface with its parameters, especially the num_workers and collate_fn parameters. Also, I am not sure if it is possible to inherit from the torch.utils.data.DataLoader class, similar to the torch.utils.data.Dataset, so I would appreciate some guidance on this.

This would be my first ever tutorial, so some guidance on formatting would be greatly helpful.

cc @suraj813 @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen

mhdadk avatar Jun 03 '20 13:06 mhdadk

I like the general idea of a tutorial, which explain some advanced use cases with custom collate functions, or some samplers. I'm not sure, if it would fit into a single tutorial or rather separate ones.

CC @jlin27 what do you think?

ptrblck avatar Jun 09 '20 03:06 ptrblck

/assigntome

zabboud avatar Jun 02 '23 01:06 zabboud

@mhdadk - As I'm working on this issue, I just wanted to check in with you regarding your initial request. Would this be just to clarify how to customize collate_fn and how to use parallel processing with the num_workers correctly? (shuffling the data would be just by passing shuffle=True argument, unless you had something else in mind.

Currently, I added additional verbiage on the usage of collate_fn and num_workers within the same tutorial. I did see many instances within pytorch forums requiring more clarity on the usage of custom collate_fn functions, so I would add more clarity on that part. I just wanted to check in with you if the following is okay - or more details would be useful.

# ``collate_fn`` can be customized, and works different if automated batching
# enabled or disabled. If automated batching is disabled, then ``collate_fn``
# is set to default and just converts NumPy arrays to PyTorch tensors.
# 
# If automatic batching is enabled, ``collate_fn`` will collate the input into
# batches, yielding from the data loader iterator. 
#
# default ``collate_fn`` collates a list of tuples of images and their labels
# into a batched image tensor and a batched label tensor. The images and label
# tensors are batched using ``torch.stack`` function, which requires that the
# input tensors be of equal shape. 
# Note: the input data structure is preserved. 
#
# You can also customize the ``collate_fn``, this might be especially useful 
# in the case that you are working with tensors of varying dimensions. Below 
# is an example of how this could be done:
# 
#```
#  def collate_fn(batch):
#     images = [sample['image'] for sample in batch]
#     landmarks = [sample['landmarks'] for sample in batch]
#     
#     max_image_size = max([image.size(2) for image in images])
#     padded_images = torch.stack([torch.nn.functional.pad(image, 
#               (0, max_image_size - image.size(1))) for image in images])
#     padding_sizes = [max_image_size - image.size(1) for image in images]
#
#     padded_landmarks = []
#     for landmarks, padding_size in zip(landmarks, padding_sizes):
#         padded_landmarks.append(torch.nn.functional.pad(landmarks, (0, padding_size)))
#
#     padded_landmarks = torch.stack(padded_landmarks)
#
#     return {'image': padded_images, 'landmarks': padded_landmarks}
#```
# This custom collate_fn() method can then be passed to the ``collate_fn`` argument to 
# the DataLoader, allowing you to customize your batching as per your specific use-case.
#
# To utilize parallel processing, you can set the ``num_workers``argument to a value > 0 
# to specify the number of loarder worker processes. 
#
# Warning: to avoid a memory leak, do not utilize multiprocessing feature with anything 
# but NumPy or PyTorch Tensors from the DataLoader __getitem__ method (avoid lists/dicts).
# See issue #13246 for more details on this problem.

zabboud avatar Jun 09 '23 05:06 zabboud

Hey @zabboud, first, thanks a lot for taking the time to work on this! It has been over 3 years now since I opened this issue, but I'm glad that someone is finally working on it :-). Second, my initial request was more of a deep dive into the torch.utils.data.Dataloader class, where a toy dataset would be initialized using the torch.utils.data.Dataset class, and then the tutorial would go on to give different examples of how to modify the parameters for the torch.utils.data.Dataloader class to obtain different behaviors.

For example, a collate_fn could be created, with a description of the collate_fn's argument batch, then num_workers is set to 0 to simulate non-parallel behavior with a custom collate_fn. Then, num_workers could be set to 2, for example, and then the tutorial would explain what happens in this case, and then compares it to the previous case when num_workers=0 in terms of performance. Afterwards, the worker_init_fn argument for the torch.utils.data.Dataloader class could be introduced, with a description of its inputs, and then an example is given for how to use it with the num_workers and collate_fn arguments.

This style of explanation could be adopted for all other parameters of the torch.utils.data.Dataloader class, including pin_memory_device (with pin_memory=True), generator, and timeout. However, what you wrote is a great first step towards this.

mhdadk avatar Jun 10 '23 14:06 mhdadk

@mhdadk Thanks for your prompt reply! I'm thinking of the best way to implement this, it sounds like it might need to be a separate tutorial while editing this main tutorial to link the user to the torch.utils.data.Dataloader tutorial, what do you think?

zabboud avatar Jun 10 '23 16:06 zabboud

@zabboud I think that would be a good idea! I am unfortunately tied up with other things at the moment for the next few weeks, so I'm not sure if I would be able to help with the code. Nevertheless, feel free to reach out if you would like some more feedback. Good luck!

mhdadk avatar Jun 10 '23 16:06 mhdadk

This issue has been unassigned due to inactivity. If you are still planning to work on this, you can still send a PR referencing this issue.

svekars avatar Oct 24 '23 18:10 svekars

/assigntome

xanderex-sid avatar Nov 01 '23 17:11 xanderex-sid

/assigntome

krishnakalyan3 avatar Nov 05 '23 12:11 krishnakalyan3