Multiple call to transforms warm_up()
@Zenglinxiao When you implemented #1912 you added setstate / getstate logics for multiprocessing.
If I am not wrong and @anderleich / @panosk faced the same issue, here is what happening:
When using build_vocab.py there is a call to make_transforms() in the main process, and then we spawn n_threads. Because we pass the transforms created in main, the pickling/unpickling mechanism triggers another call to warm_up() in the __setstate__ hence we could avoid the first call to warm_up in the make_transforms.
Even when we use n_threads=1 we spawn another process so same behavior.
When we train the story is a little different.
If we use num_worker=0 the Dataloader is not used, everything is happening in the main process, hence calling warm_up is required somewhere (currently in the make_transforms of the build_dynamic_dataset_iter
If num_worker>0then we fall back in the same situation as in build_vocab.
What do you think should be the best approach to avoid double warm_up (which is quite annoying for some transforms that loads big stuff)
cc @francoishernandez
Hi @vince62s, Sorry for the late reply due to limited bandwidth.
The transforms will warm_up twice in the build_vocab step, but not in the training step. That was the case before version 3.0. We did not bother to optimize for build_vocab as it was not an urgent issue at that time.
The changes introduced in the release 3.0 switching to Dataloader makes things different: https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L346-L351
The data_iter is already initialized with transform objects, since the num_workers will be passed to dataloader, it will spawn other processes for multiprocessing. Then transform objects will be picked and unpicked through __getstate__ and __setstate__ method during multiprocessing.
To mitigate this issue, we should only initialize/warm_up transform objects in the processes where they will be used to avoid pickling and unpicking.
One way we can try is to delay make_transform to _init_datasets to solve this. https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L185
def build_dynamic_dataset_iter
...
# transforms = make_transforms(opt, transforms_cls, vocabs)
# delay this to DynamicDatasetIter._init_datasets
corpora = get_corpora(opt, task)
if corpora is None:
assert task != CorpusTask.TRAIN, "only valid corpus is ignorable."
return None
data_iter = DynamicDatasetIter.from_opt(
corpora, transforms_cls, # pass class rather than object
vocabs, opt, task, copy=copy,
stride=stride, offset=offset)
...
I really thought your bandwidth was unlimited :) thanks for your feedback, appreciated.
There is an issue with your solution. We need to access "opt" to build the transforms from the cls.
The way we build data_iter with its from_opt method makes it difficult (or not very natural) to redefine the full attribute opt to the DynalicDayasetIter
Instead what if we remove the warm_up from def make_transform() and we call it manually only in the case of num_workers=0
In all other cases, the set_state will do the job.
Would that work ?