Samit

Results 7 comments of Samit

"we consider the first frame of a video to be an image..." I see, the first frame is always encoded from the repeated k-1 1st frames. But for upsampling, the...

> Sorry for that. We merge that to fix this bug. thanks. btw, since the computation logic is changed, the model may require re-training.

+1 Looking forward to the open-source of text2video model

I see. So attention map complexity will be (H*W*T)^2. Is it feasible for long video training? Are there any generation results using the train code? (Loss curve in diffusion model...

Please supplement README on accuracy and performance compared to ViT

Please report the results for crnn server version and upload the checkpoint and mindir.

Thanks. checkpoint保存:每个epoch结束保存ckpt。 这个可选last_k 或者top_k保存策略。