OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

Load and keep small dataset in memory when training

Open AmitMY opened this issue 4 years ago • 3 comments

I'm experimenting with training a various number of training examples to see the effect on accuracy for me. Its a very easy task, so it makes sense to even use 69 examples.

When I use such small number of examples, the training script just loads the dataset many many times.

Is there a way to tell the script to keep the data in memory? My GPU utilization is very very small, and I believe its because its spending most time on loading data.

image

AmitMY avatar Nov 13 '20 05:11 AmitMY

Indeed, using so few data is quite an edge use case here. Since you're using the legacy version (< 2), you may try modifying the DatasetLazyIter class to retain your data in memory instead of loading it again and again. If you're planning on switching to 2.0, you may want to have a look at the ParallelCorpus class. (This one may be easier to modify if you're not comfortable with the codebase.)

If you're willing to make this configurable through a flag without breaking current functionality, we'd be willing to accept a PR as it may be useful for others.

francoishernandez avatar Nov 13 '20 08:11 francoishernandez

Hi, I also have the same question, How did you finally solve this problem? Thanks!

Zhw098 avatar Nov 30 '21 03:11 Zhw098

@Zhw098 as stated above the easiest solution is to modify the ParallelCorpus class to add some caching for instance. E.g. in the load method only read the file once, and store it in some self.cache for instance, and then reuse it.

Else, if you're not comfortable with python, you can always make a bigger version of your dataset by concatenating it N times so that it will load less frequently...

francoishernandez avatar Nov 30 '21 11:11 francoishernandez