DMTimg pre-training ablation
From the paper also, in the 4.2 ablation study, I don't find any clear element why the network need to be trained in two stages (DMTimg and DMTvid) especially cause you have claimed:
Therefore, we train DMTvid from scratch instead of fine-tuning it on DMTimg
Was there any specific issue to unify the loss to a have an E2E single stage approach?
Please refer to the previous sections (e.g., introduction, related work) to better understand the key motivation (deficiency awareness) of our paper. And we have made extensive efforts to employ an end-to-end unified network trained with both image and video objective functions and datasets. However, despite those efforts, achieving the current high performance using such approach remains a challenging task.
Yes I understand the deficency awareness issues. So the main issue it seems to me, also if not explicitly available in the paper, is related to train it with a common objective function between deficency/generative and the more "classical" video inpainting right?