VLP icon indicating copy to clipboard operation
VLP copied to clipboard

Performance improvements for data loading process

Open aleSuglia opened this issue 5 years ago • 2 comments

Hi,

First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset as an IterableDataset that supports streams of data.

I've noticed that the current implementation of the dataset has already an __iter__ method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1) to sample a given example index. This is incorrect because randint won't guarantee that the sampled elements are going to be unique.

I might have soon a fix for this so I can send you a PR if you like :)

Thank you in advance for your answer!

Alessandro

aleSuglia avatar Oct 13 '19 17:10 aleSuglia

Hi @aleSuglia, yes, you're right. With the current implementation (same for UniLM), the sample does not guarantee to be unique. I do not see this affect the training much but please feel free to send your PR! Thanks.

LuoweiZhou avatar Oct 14 '19 04:10 LuoweiZhou

@aleSuglia I'd be interested in these improvements as well! Please do create a PR!

darkmatter08 avatar Oct 31 '19 03:10 darkmatter08