iVideoGPT
iVideoGPT copied to clipboard
a question about max_attn_resolution and crossattn layer numbers
I see in the provided example checkpoint, max_attn_resolution is set to be 16. So during encoding, the image will go through downblocks of 64x64, 32x32, 16x16, 16x16, cross_attn is added twice after 16x16 downblocks. Yet during decoding the image will go through 16x16, 32x32, 64x64, 64x64, and cross_attn is added only once, is this an expected behavior(resulting in asymmetric encoder and decoder structure)?