DMT DMTimg pre-training ablation

From the paper also, in the 4.2 ablation study, I don't find any clear element why the network need to be trained in two stages (DMTimg and DMTvid) especially cause you have claimed:

Therefore, we train DMTvid from scratch instead of fine-tuning it on DMTimg

Was there any specific issue to unify the loss to a have an E2E single stage approach?

Jul 19 '23 19:07 bhack

Please refer to the previous sections (e.g., introduction, related work) to better understand the key motivation (deficiency awareness) of our paper. And we have made extensive efforts to employ an end-to-end unified network trained with both image and video objective functions and datasets. However, despite those efforts, achieving the current high performance using such approach remains a challenging task.

Jul 20 '23 02:07 yeates

Yes I understand the deficency awareness issues. So the main issue it seems to me, also if not explicitly available in the paper, is related to train it with a common objective function between deficency/generative and the more "classical" video inpainting right?

Jul 20 '23 11:07 bhack