DiffuSeq icon indicating copy to clipboard operation
DiffuSeq copied to clipboard

Working with larger datasets

Open mainpyp opened this issue 2 years ago • 2 comments

Hi , thank you for this awesome project. I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?

Thanks again & best!

mainpyp avatar Mar 22 '23 07:03 mainpyp

Hi, Maybe you can try to add keep_in_memory = True in function raw_datasets.map https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L78

If it doesn't work, you can try to split your datasets into separate folds and load them respectively in different training steps.

summmeer avatar Mar 22 '23 09:03 summmeer

It`s still not working properly. But I think that has something to do with padding and my sequence lengths. I have to investigate that further, but thank you for your help! :) I found a small thing that accelerated the data loading time a lot:

https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L163-L166

here a line is loaded twice. By reading it once and then accessing the src and trg I saved a lot of time.

with open(path, 'r') as f_reader: 
     for row in f_reader: 
         line = json.loads(row)
         sentence_lst['src'].append(line['src'].strip()) 
         sentence_lst['trg'].append(line['trg'].strip())

mainpyp avatar Mar 23 '23 16:03 mainpyp