VideoTetris
VideoTetris copied to clipboard
Clarification of the arxiv paper content
Thanks for the great work! Amazing results. I was looking at the arxiv paper and had some difficulty understand some key concepts. Could you kindly help me clarify?
- I understand that (Section 3.1) given a written user prompt, you will decompose it spatially and locally, and compute attention score separately, and merge them. But I am not sure where is the autoregressive part. In this framework, how is the progressive following ability be achieved?
Thank you!