guided-diffusion
guided-diffusion copied to clipboard
Bug in attention?
The scale is being calculated here as 1 / math.sqrt(math.sqrt(ch))
. The comment says it was adapted from the attention implementation here, where the scale is int(C) ** (-0.5)
, which is 1 / math.sqrt(ch)
, not 1 / math.sqrt(math.sqrt(ch))
.
Is this change to use 2 square roots intentional?
Hello, At first, it seems like a bug but it isn't one, because it appears two times in the einsum product https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/unet.py#L348-L351
An extra comment would have been welcomed because it looks like a typo at first sight.
So we are computing the right thing :
QKT/ math.sqrt(dk)= (Q/ math.sqrt(math.sqrt(dk))) *(KT/ math.sqrt(math.sqrt(dk)))
The justification for the scaling actor is given in the "Attention is all you need" paper https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
In the foot note of the above paper : To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot product, q · k = ∑dk i=1 qiki, has mean 0 and variance dk .
This probably means that to really do the right thing the q and k should be of unit variance, ( which probably isn't the case here because they are using a GroupNorm(32) ) https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/unet.py#L302 https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L93-L100 Which means that depending on the number of heads and number of channels, the input variance may not be 1 (to further check... ) )
Got it - thanks for the info!