TeaCache icon indicating copy to clipboard operation
TeaCache copied to clipboard

Why the code for Wan2.1 uses t_mod to determine whether to cache rather than using timestep embedding?

Open fffanty opened this issue 7 months ago • 5 comments

Hi, @zishen-ucap. Your observation is right. For some model, timestep embedding shows better correlation, e.g., Wan2.1 and CogVideoX. With timestep embedding before 'time_projection', it is unnecessary to to set the 'fifth step'. You can try it.

Hi,@LiewFeng may I ask a question? Why the code for Wan2.1 uses t_mod to determine whether to cache rather than using timestep embedding?

Originally posted by @fffanty in #39

fffanty avatar May 19 '25 09:05 fffanty

What does t_mod mean? We use timestep embedding for Wan2.1.

LiewFeng avatar May 20 '25 02:05 LiewFeng

@LiewFeng Sorry, I misunderstood before.If you can share what datasets and how many dataset you used to generate the coefficients of Wan2.1?

fffanty avatar May 21 '25 09:05 fffanty

As described in the paper, we sample 70 prompts from T2V-CompBench to generate videos, assessing seven desired attributes of generated videos. 10 prompts are sampled for each attributes.

LiewFeng avatar May 25 '25 09:05 LiewFeng

Same question — why do we even need a timestep embedding if it’s only a function of t and freq_dim, and has nothing to do with the model’s actual input or output?

jhl13 avatar Aug 14 '25 02:08 jhl13

The paper states: "As for the timestep embedding, it changes as timesteps progress but is independent of the noisy input and text embedding, making it difficult to fully reflect the information of the input. The noisy input, on the other hand, is gradually updated during the denoising process and contains information from the text embedding, but it is not sensitive to timesteps. To comprehensively represent the model inputs and ensure their correlation with the outputs, we ultimately utilized the timestep embedding modulated noisy input at the Transformer’s input stage as the final input embedding"

jhl13 avatar Aug 14 '25 02:08 jhl13

As described in the paper, we sample 70 prompts from T2V-CompBench to generate videos, assessing seven desired attributes of generated videos. 10 prompts are sampled for each attributes.

Fig3 has only three attributes(noisy input, timestep-embedding modulated noisy input, timestep embedding), and the model output should be used for each attribute to do the polyfit. So, is there 4 other attributes found to be unuseful and then you didn't conclude it in the paper? Just want to confirm the info, thx.

compatiblewaterfire avatar Dec 08 '25 03:12 compatiblewaterfire