DiffuSeq icon indicating copy to clipboard operation
DiffuSeq copied to clipboard

About loss in training_losses_seq2seq() when time step t=0

Open skpig opened this issue 2 years ago • 5 comments

Thanks for your great work. I have a question about loss calculation in training_losses_seq2seq() when the sampled time step t=0

https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L612-L619

If t=0. The x_t = self.q_sample() line is incorrect, since it tries to sample $x_0$ from $q(x_t|x_0)$. Therefore the model_output is invalid since x_t is invalid. Then it seems like you try to replace the invalid term in the following code.

https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L623-L626

But you still use the invalid variable model_output to calculate MSE loss.

Is there anything I misunderstand? Could you please help me and clarify the code? Thanks.

skpig avatar Feb 26 '23 06:02 skpig

Hi, the q_sample() can work for t=0, which returns x_start. The model learns the mse loss between $x_0$ and $Emb(w^x)$ here.

summmeer avatar Feb 27 '23 02:02 summmeer

Sorry, but I still don't understand why q_sample() returns x_start when t=0.

https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L612

The param x_start here equals to x_start_mean + self.sqrt_one_minus_alphas_cumprod[0] * noise (according to self._get_x_start()). So it stands for $x_0 = Emb(w) + \beta_0\times noise$ in your paper.

https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L233-L259

Now dig into the q_sample() function. The return variable x_t equals to self.sqrt_alphas_cumprod[0] * x_start + self.sqrt_one_minus_alphas_cumprod[0] * noise. So it stands for $x_t = \alpha_0 x_0 + \beta_0\times noise$. I don't know what the return value x_t stands for.

skpig avatar Feb 27 '23 03:02 skpig

Hi, Sorry for the ambiguity, let me elucidate. For canonical noise scheduler, $\beta_0 \rightarrow 0$, so it returns $x_0$. However, we use the sqrt noise scheduler, where $\beta_0=0.121$ when $T=2000$. The model learns the mse loss between $\alpha_0 x_0 + \beta_0 * noise$ and $Emb(w^x) + \beta_0 * noise$, so the statement above still holds.

summmeer avatar Feb 27 '23 07:02 summmeer

Sorry, but I cannot deduce your conclusion from your code.

The model learns the mse loss between α0x0+β0∗noise and Emb(wx)+β0∗noise

Please give me a hint based on some code links. Again, thank you for your patience!

skpig avatar Mar 02 '23 02:03 skpig

Hi! I got a similar question regarding the quoted lines.

It seems to me that model_out_x_start=model_output since we are not predicting the noise term. In addition, we are not using pred_xprev. In this case, why we bother to use the helper function _x0_helper?

Thanks!

LetianY avatar May 03 '24 04:05 LetianY