DiffuSeq
DiffuSeq copied to clipboard
Working with larger datasets
Hi , thank you for this awesome project. I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?
Thanks again & best!
Hi,
Maybe you can try to add keep_in_memory = True in function raw_datasets.map
https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L78
If it doesn't work, you can try to split your datasets into separate folds and load them respectively in different training steps.
It`s still not working properly. But I think that has something to do with padding and my sequence lengths. I have to investigate that further, but thank you for your help! :) I found a small thing that accelerated the data loading time a lot:
https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L163-L166
here a line is loaded twice. By reading it once and then accessing the src and trg I saved a lot of time.
with open(path, 'r') as f_reader:
for row in f_reader:
line = json.loads(row)
sentence_lst['src'].append(line['src'].strip())
sentence_lst['trg'].append(line['trg'].strip())