iVideoGPT icon indicating copy to clipboard operation
iVideoGPT copied to clipboard

a question about max_attn_resolution and crossattn layer numbers

Open yangyichu opened this issue 7 months ago • 1 comments

I see in the provided example checkpoint, max_attn_resolution is set to be 16. So during encoding, the image will go through downblocks of 64x64, 32x32, 16x16, 16x16, cross_attn is added twice after 16x16 downblocks. Yet during decoding the image will go through 16x16, 32x32, 64x64, 64x64, and cross_attn is added only once, is this an expected behavior(resulting in asymmetric encoder and decoder structure)?

yangyichu avatar Jul 12 '24 05:07 yangyichu