rcg
rcg copied to clipboard
Comparison with self-conditioning proposed in Analog Bits, and basic two pass sampling baselines
Dear authors: Thank you for opensourcing your great work RCG.
However, I have noticed that:
-
A closely related technique called self-conditioning (NO clustering is performed, in contrast to 3, 34, 40) has been proposed in Analog Bits, https://arxiv.org/abs/2208.04202. This technique leverage x0 prediction of previous denoising step as condition and greatly improve performance of Analog Bits. Recent works have shown that this technique is also effective for continuous generation. (Note: Analog Bits use continuous state space and discretize the final sample.) This technique is compatible with parallel decoding methods as well. Parallel decoding methods will predict all masked tokens in each step while only accepting those with top confidence. However, all predicted tokens can be use as condition for the next prediction step.
-
All baseline methods use only one sampling pass, while RCG use two sampling passes. This may potentially cause unfair comparison. It is well known that diffusion models can achieve higher generation quality by using two sampling pass (first denoise gaussian noise into an intermediate result, then add noise of proper scale to it and finally denoise it again), even without any specific training. Thus, a naive baseline can be construct as described above.
Moreover, it is straight forward to design a two pass sampling aware algorithm, i.e. the first pass generate an intermediate result (optionally stop grad, and optionally use a frozen model, and optionally use partial forward process/masking plus one step denoising/reconstructing), the second pass use the encoded intermediate result as condition. Of course this naive design might be inefficient to train. Fortunately, self-conditioning is fully compatible with two pass sampling.
Would you like to include more comparison and discussion on these aspects?
Thank you for any help you can offer.