Sansa Gong
Sansa Gong
Hi, the time you estimate is close to ours, with 4 80G A100 GPUs. Using FP16 could save training time (we didn't implement this in the current version of code).
Hi, Yes, `tT_loss` does not pass the transformer layer, but there are still learnable params, i.e. the params of word embedding (from `x_start`). We can regard it as a kind...
Maybe you can try to decode using single GPU.
Hi, I think the model is not well-trained so it can not recover meaningful tokens. Maybe you could try other hyper-params. Another concern is that the size of your dataset...
Actually it's not easy, because training and inference stages are not strictly symmetrical. You can try to recover 50% noised data instead of pure Gaussian noise.
Hi, In diffusion process, recovering the noise $\epsilon$ or $x_0$ or $x_{t-1}$ all could work, as long as the process is symmetric between training and sampling. Previous works show that...
DiffuSeq focuses on the conditional generation (generate `y` given `x`) while Diffusion-LM focuses on the generation with constraints (generate sentence `s` given attribute `a`). Using additional model is orthogonal to...
Currently, the pad is treated as a regular token, and the generated length could change in the generation process. It can avoid the need for an additional length prediction module,...
Hi, We didn't use the saved embedding. The word embedding params are built into the model, so the resume operation could load it.
Rounding operation maps the word embedding vectors back to discrete tokens. and we still map these tokens into vectors as the input of next-step generation. This operation makes sure that...