About loss in training_losses_seq2seq() when time step t=0
Thanks for your great work.
I have a question about loss calculation in training_losses_seq2seq() when the sampled time step t=0
https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L612-L619
If t=0. The x_t = self.q_sample() line is incorrect, since it tries to sample $x_0$ from $q(x_t|x_0)$. Therefore the model_output is invalid since x_t is invalid.
Then it seems like you try to replace the invalid term in the following code.
https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L623-L626
But you still use the invalid variable model_output to calculate MSE loss.
Is there anything I misunderstand? Could you please help me and clarify the code? Thanks.
Hi,
the q_sample() can work for t=0, which returns x_start. The model learns the mse loss between $x_0$ and $Emb(w^x)$ here.
Sorry, but I still don't understand why q_sample() returns x_start when t=0.
https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L612
The param x_start here equals to x_start_mean + self.sqrt_one_minus_alphas_cumprod[0] * noise (according to self._get_x_start()). So it stands for $x_0 = Emb(w) + \beta_0\times noise$ in your paper.
https://github.com/Shark-NLP/DiffuSeq/blob/bdc8f0adbff22e88c8530d1f20c3c7589c061d40/diffuseq/gaussian_diffusion.py#L233-L259
Now dig into the q_sample() function.
The return variable x_t equals to self.sqrt_alphas_cumprod[0] * x_start + self.sqrt_one_minus_alphas_cumprod[0] * noise. So it stands for $x_t = \alpha_0 x_0 + \beta_0\times noise$. I don't know what the return value x_t stands for.
Hi, Sorry for the ambiguity, let me elucidate. For canonical noise scheduler, $\beta_0 \rightarrow 0$, so it returns $x_0$. However, we use the sqrt noise scheduler, where $\beta_0=0.121$ when $T=2000$. The model learns the mse loss between $\alpha_0 x_0 + \beta_0 * noise$ and $Emb(w^x) + \beta_0 * noise$, so the statement above still holds.
Sorry, but I cannot deduce your conclusion from your code.
The model learns the mse loss between α0x0+β0∗noise and Emb(wx)+β0∗noise
Please give me a hint based on some code links. Again, thank you for your patience!
Hi! I got a similar question regarding the quoted lines.
It seems to me that model_out_x_start=model_output since we are not predicting the noise term. In addition, we are not using pred_xprev. In this case, why we bother to use the helper function _x0_helper?
Thanks!