VLP
VLP copied to clipboard
Performance improvements for data loading process
Hi,
First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset
as an IterableDataset
that supports streams of data.
I've noticed that the current implementation of the dataset has already an __iter__
method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1)
to sample a given example index. This is incorrect because randint
won't guarantee that the sampled elements are going to be unique.
I might have soon a fix for this so I can send you a PR if you like :)
Thank you in advance for your answer!
Alessandro
Hi @aleSuglia, yes, you're right. With the current implementation (same for UniLM), the sample does not guarantee to be unique. I do not see this affect the training much but please feel free to send your PR! Thanks.
@aleSuglia I'd be interested in these improvements as well! Please do create a PR!