Wondering about attention
First of all - Thanks for the great implementation, which is very readable and simple to use and reach quick results with.
I noticed you're using the self attention in a different manner than the one written by Ho et al - they used the self attention in the 16x16 resolution layers in the network (both for 32x32 and 256x256 inputs) while you used it in the 8x8 resolution layers in the U-Net. Did you just happen to use it there / was there some guiding logic behind it?
Hi @yanivnik
Thank you for pointing this out. This was not intentional on my part, and I appreciate you bringing it up. I did not thoroughly compare the placement for the self-attention layer, so am not sure if my version results in decreased sample quality. If you find that it does, I would be happy to modify the code to better reflect what is done in the original work.