DataLoaders.jl icon indicating copy to clipboard operation
DataLoaders.jl copied to clipboard

Why no custom collate?

Open alok opened this issue 3 years ago • 2 comments

I find them pretty handy in pytorch

alok avatar Apr 16 '21 08:04 alok

Don't see a reason why there can't be. We'll just need to update BatchViewCollated to accept a user collate function.

darsnack avatar Apr 16 '21 13:04 darsnack

As Kyle pointed out, this will not be quite as straightforward if we want to support inplace data loading for custom collate functions. Below is a sketch of a possible solution, depending on the use case for it.

Currently a batch is recursively defined as either:

  • an AbstractArray with one dimension being the batch size
  • a Tuple of batches
  • a NamedTuple of batches
  • a Dict of batches

Importantly, getobs! is a property of the data container, not the BatchViewCollated. Let's say we have a data container DC with observations of type O so we have: getobs(::DC, idx)::O and getobs!(::O, ::DC, idx)::O.

The question is what you want to achieve through a custom colaltefn. If you want to return custom data types as batches, then the following would work:

  • have a custom collatefn that returns batches of type B
  • define a method of DataLoaders.obsslices(::B, ::DataLoaders.BatchDim) that returns an iterator over views of type O. For example, if O is an array type, then it should return array views.

Of course, if we don't want to support buffering and custom collate functions (as is the case in PyTorch if I'm not mistaken), we could simply make buffered and collatefn arguments on DataLoader mutually exclusive.

lorenzoh avatar Apr 22 '21 10:04 lorenzoh