Why the code for Wan2.1 uses t_mod to determine whether to cache rather than using timestep embedding?
Hi, @zishen-ucap. Your observation is right. For some model, timestep embedding shows better correlation, e.g., Wan2.1 and CogVideoX. With timestep embedding before 'time_projection', it is unnecessary to to set the 'fifth step'. You can try it.
Hi,@LiewFeng may I ask a question? Why the code for Wan2.1 uses t_mod to determine whether to cache rather than using timestep embedding?
Originally posted by @fffanty in #39
What does t_mod mean? We use timestep embedding for Wan2.1.
@LiewFeng Sorry, I misunderstood before.If you can share what datasets and how many dataset you used to generate the coefficients of Wan2.1?
As described in the paper, we sample 70 prompts from T2V-CompBench to generate videos, assessing seven desired attributes of generated videos. 10 prompts are sampled for each attributes.
Same question — why do we even need a timestep embedding if it’s only a function of t and freq_dim, and has nothing to do with the model’s actual input or output?
The paper states: "As for the timestep embedding, it changes as timesteps progress but is independent of the noisy input and text embedding, making it difficult to fully reflect the information of the input. The noisy input, on the other hand, is gradually updated during the denoising process and contains information from the text embedding, but it is not sensitive to timesteps. To comprehensively represent the model inputs and ensure their correlation with the outputs, we ultimately utilized the timestep embedding modulated noisy input at the Transformer’s input stage as the final input embedding"
As described in the paper, we sample 70 prompts from T2V-CompBench to generate videos, assessing seven desired attributes of generated videos. 10 prompts are sampled for each attributes.
Fig3 has only three attributes(noisy input, timestep-embedding modulated noisy input, timestep embedding), and the model output should be used for each attribute to do the polyfit. So, is there 4 other attributes found to be unuseful and then you didn't conclude it in the paper? Just want to confirm the info, thx.