Diff-Harmonization icon indicating copy to clipboard operation
Diff-Harmonization copied to clipboard

Question about latent range in DDIM inversion: X_T or X_{T-1}?

Open Donus-S opened this issue 8 months ago • 1 comments

Dear Authors,

Thank you for your great work.

I have a question regarding the following DDIM inversion code:

diff_harmon.py line 268~287

for t in tqdm(timesteps[:-1], desc="DDIM_inverse"):
    latents_input = torch.cat([latents] * 2)
    noise_pred = model.unet(latents_input, t, encoder_hidden_states=context)["sample"]
    noise_pred_uncond, noise_prediction_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_prediction_text - noise_pred_uncond)

    next_timestep = t + model.scheduler.config.num_train_timesteps // model.scheduler.num_inference_steps
    alpha_bar_next = model.scheduler.alphas_cumprod[next_timestep] \
        if next_timestep <= model.scheduler.config.num_train_timesteps else torch.tensor(0.0)

    # leverage reversed x0
    reverse_x0 = (1 / torch.sqrt(model.scheduler.alphas_cumprod[t]) * (
        latents - noise_pred * torch.sqrt(1 - model.scheduler.alphas_cumprod[t])))

    latents = reverse_x0 * torch.sqrt(alpha_bar_next) + torch.sqrt(1 - alpha_bar_next) * noise_pred

    all_latents.append(latents)

# all_latents[N] -> N: DDIM steps (X_{T-1} ~ X_0)
return latents, all_latents

From what I understand, when t = T-1 (961), the next_timestep becomes T (981), meaning the alpha_bar_next is α_T (α_981), so the newly computed latent should correspond to X_T (X_981).

However, according to the comment at the end of the code (all_latents[N] corresponds to X_{T-1} ~ X_0), it seems the stored latents start from X_{T-1}, not X_T.

Could you please clarify this point? Specifically: why is the range of all_latents described as X_{T-1} ~ X_0, instead of X_T ~ X_0?

Thank you in advance for your help!

Donus-S avatar Apr 06 '25 22:04 Donus-S

Hello @Donus-S,

Thank you for your recognition of our work.

As the code was written quite some time ago, I tried to recall the exact reason behind that comment but unfortunately couldn't remember it precisely.

My guess is that the comment refers to the fact that, to transform x_0 to x_T (or vice versa), there should be T steps, where T = len(timesteps). Since we're only using timesteps[:-1] here, we go from x_0 to x_{T−1}, not all the way to x_T.

Hope this helps!

WindVChen avatar Apr 07 '25 13:04 WindVChen