OpenNMT-py
OpenNMT-py copied to clipboard
Load and keep small dataset in memory when training
I'm experimenting with training a various number of training examples to see the effect on accuracy for me. Its a very easy task, so it makes sense to even use 69 examples.
When I use such small number of examples, the training script just loads the dataset many many times.
Is there a way to tell the script to keep the data in memory? My GPU utilization is very very small, and I believe its because its spending most time on loading data.
Indeed, using so few data is quite an edge use case here.
Since you're using the legacy
version (< 2), you may try modifying the DatasetLazyIter
class to retain your data in memory instead of loading it again and again.
If you're planning on switching to 2.0, you may want to have a look at the ParallelCorpus
class. (This one may be easier to modify if you're not comfortable with the codebase.)
If you're willing to make this configurable through a flag without breaking current functionality, we'd be willing to accept a PR as it may be useful for others.
Hi, I also have the same question, How did you finally solve this problem? Thanks!
@Zhw098 as stated above the easiest solution is to modify the ParallelCorpus
class to add some caching for instance. E.g. in the load
method only read the file once, and store it in some self.cache
for instance, and then reuse it.
Else, if you're not comfortable with python, you can always make a bigger version of your dataset by concatenating it N times so that it will load less frequently...