tutorials
tutorials copied to clipboard
Tutorial on custom dataloaders (NOT datasets)
I really like this tutorial on custom datasets. However, the torch.utils.data.DataLoader
class is only briefly mentioned in it:
However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:
- Batching the data
- Shuffling the data
- Load the data in parallel using multiprocessing workers.
torch.utils.data.DataLoader
is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn . You can specify how exactly the samples need to be batched using collate_fn . However, default collate should work fine for most use cases.
I am aware of this issue and this issue but neither have led to a tutorial.
I am happy to make a tutorial on custom dataloaders using the torch.utils.data.DataLoader
class, focusing on how to interface with its parameters, especially the num_workers
and collate_fn
parameters. Also, I am not sure if it is possible to inherit from the torch.utils.data.DataLoader
class, similar to the torch.utils.data.Dataset
, so I would appreciate some guidance on this.
This would be my first ever tutorial, so some guidance on formatting would be greatly helpful.
cc @suraj813 @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen
I like the general idea of a tutorial, which explain some advanced use cases with custom collate functions, or some samplers. I'm not sure, if it would fit into a single tutorial or rather separate ones.
CC @jlin27 what do you think?
/assigntome
@mhdadk - As I'm working on this issue, I just wanted to check in with you regarding your initial request. Would this be just to clarify how to customize collate_fn
and how to use parallel processing with the num_workers
correctly? (shuffling the data would be just by passing shuffle=True
argument, unless you had something else in mind.
Currently, I added additional verbiage on the usage of collate_fn
and num_workers
within the same tutorial. I did see many instances within pytorch forums requiring more clarity on the usage of custom collate_fn
functions, so I would add more clarity on that part. I just wanted to check in with you if the following is okay - or more details would be useful.
# ``collate_fn`` can be customized, and works different if automated batching
# enabled or disabled. If automated batching is disabled, then ``collate_fn``
# is set to default and just converts NumPy arrays to PyTorch tensors.
#
# If automatic batching is enabled, ``collate_fn`` will collate the input into
# batches, yielding from the data loader iterator.
#
# default ``collate_fn`` collates a list of tuples of images and their labels
# into a batched image tensor and a batched label tensor. The images and label
# tensors are batched using ``torch.stack`` function, which requires that the
# input tensors be of equal shape.
# Note: the input data structure is preserved.
#
# You can also customize the ``collate_fn``, this might be especially useful
# in the case that you are working with tensors of varying dimensions. Below
# is an example of how this could be done:
#
#```
# def collate_fn(batch):
# images = [sample['image'] for sample in batch]
# landmarks = [sample['landmarks'] for sample in batch]
#
# max_image_size = max([image.size(2) for image in images])
# padded_images = torch.stack([torch.nn.functional.pad(image,
# (0, max_image_size - image.size(1))) for image in images])
# padding_sizes = [max_image_size - image.size(1) for image in images]
#
# padded_landmarks = []
# for landmarks, padding_size in zip(landmarks, padding_sizes):
# padded_landmarks.append(torch.nn.functional.pad(landmarks, (0, padding_size)))
#
# padded_landmarks = torch.stack(padded_landmarks)
#
# return {'image': padded_images, 'landmarks': padded_landmarks}
#```
# This custom collate_fn() method can then be passed to the ``collate_fn`` argument to
# the DataLoader, allowing you to customize your batching as per your specific use-case.
#
# To utilize parallel processing, you can set the ``num_workers``argument to a value > 0
# to specify the number of loarder worker processes.
#
# Warning: to avoid a memory leak, do not utilize multiprocessing feature with anything
# but NumPy or PyTorch Tensors from the DataLoader __getitem__ method (avoid lists/dicts).
# See issue #13246 for more details on this problem.
Hey @zabboud, first, thanks a lot for taking the time to work on this! It has been over 3 years now since I opened this issue, but I'm glad that someone is finally working on it :-). Second, my initial request was more of a deep dive into the torch.utils.data.Dataloader
class, where a toy dataset would be initialized using the torch.utils.data.Dataset
class, and then the tutorial would go on to give different examples of how to modify the parameters for the torch.utils.data.Dataloader
class to obtain different behaviors.
For example, a collate_fn
could be created, with a description of the collate_fn
's argument batch
, then num_workers
is set to 0
to simulate non-parallel behavior with a custom collate_fn
. Then, num_workers
could be set to 2
, for example, and then the tutorial would explain what happens in this case, and then compares it to the previous case when num_workers=0
in terms of performance. Afterwards, the worker_init_fn
argument for the torch.utils.data.Dataloader
class could be introduced, with a description of its inputs, and then an example is given for how to use it with the num_workers
and collate_fn
arguments.
This style of explanation could be adopted for all other parameters of the torch.utils.data.Dataloader
class, including pin_memory_device
(with pin_memory=True
), generator
, and timeout
. However, what you wrote is a great first step towards this.
@mhdadk Thanks for your prompt reply! I'm thinking of the best way to implement this, it sounds like it might need to be a separate tutorial while editing this main tutorial to link the user to the torch.utils.data.Dataloader
tutorial, what do you think?
@zabboud I think that would be a good idea! I am unfortunately tied up with other things at the moment for the next few weeks, so I'm not sure if I would be able to help with the code. Nevertheless, feel free to reach out if you would like some more feedback. Good luck!
This issue has been unassigned due to inactivity. If you are still planning to work on this, you can still send a PR referencing this issue.
/assigntome
/assigntome