Diffusion-BERT icon indicating copy to clipboard operation
Diffusion-BERT copied to clipboard

Inquiry on some details of the method.

Open leekum2018 opened this issue 2 years ago • 7 comments

As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!

leekum2018 avatar Dec 20 '22 13:12 leekum2018

Hi,

Yes, we generate all tokens in one diffusion step. We use ddim sampling to predict $x_0$ and get $x_{t-1}$ from the forward process. The demonstration in Table 1 shows the input of BERT at time step $t-1$.

Besides, the corresponding predicted $x_0$ is composed of less informative tokens when $t$ is large and gradually shows semantic meaning as $t$ goes to 0. That is also the motivation of our spindle noise schedule.

Hope this helps. If you have more questions please feel free to contact with me.

Hzfinfdu avatar Dec 20 '22 14:12 Hzfinfdu

Thank you for your reply! I have a further question. According to your reply, does it means you model $p_{\theta}(x_{t-1}|x_t)$ as Screenshot 2022-12-21 at 12 29 17 And is the term $\widetilde{p}(\widetilde{x}_{0}|x_t)$ the output of BERT? Thank you!

leekum2018 avatar Dec 21 '22 04:12 leekum2018

Yes, that's right. DDIM sampling helps to trade off speed and generation quality. And predicting $x_0$ directly is closer to the MLM training objective.

Hzfinfdu avatar Dec 22 '22 10:12 Hzfinfdu

Hi, I have another question. In eq. 9, how to compute $H(x_{0}^{i})$, in other words, what is the distribution of $x_{0}^{i}$ for calculating $H(x_{0}^{i})$. Because I have a hard time understanding why the following equation holds. Screenshot 2022-12-27 at 13 33 19 Thank you!

leekum2018 avatar Dec 27 '22 05:12 leekum2018

Hi,

In fact, $H(x_0^i)$ can be calculated in many ways. We calculate the entropy of each token by the negative logarithm of its frequency in the tokenized training corpus.

Since a masked token loses all its information, the expected information loss of the i-th token at $t$ is $\overline{\alpha}_t^iH(\textbf{x}_0^i)$. We get Eq. 9 by taking the sum over the sequence.

Hope this helps.

Hzfinfdu avatar Dec 27 '22 05:12 Hzfinfdu

For the following formula Structured Denoising Diffusion Models in Discrete State-Spaces, why the RHS is proportional to RHS? Could you please give me some hints? I have a hard time deriving this. Please. image

leekum2018 avatar Jan 10 '23 12:01 leekum2018

Hi @leekum2018,

you can refer this : https://openreview.net/forum?id=h7-XixPCAL&noteId=xm7onR_Sg0L

Hope it helps!

Siddharth-Shrivastava7 avatar Aug 18 '23 03:08 Siddharth-Shrivastava7