Tune-A-Video
Tune-A-Video copied to clipboard
Question about training loss.
Thank you for your excellent work, which has been very inspiring to me.
I have some questions about the loss function used for fine-tuning your network in the context of your paper. In the paper, you mentioned using 'the same training objective in standard LDMs' during fine-tuning. However, in Figure 4 of the paper, it is stated that the network uses a pixel-wise reconstruction loss, which seems to compute based on the input video and the reconstructed video instead of the predicted noise. Could you please clarify if I am misunderstanding something?
他的finetune网络估计就是这样训练的
I have a same question!!!!! please help!