EscherNet
EscherNet copied to clipboard
Some questions about image encoder and reference images
Thanks for your nice work. I'm still confused by the choice of ConvNeXt2 as the image encoder in this project. It is mentioned that the reason for employing ConvNeXt2 is because the frozen CLIP can only accept one reference image and only extract high-level semantic features.
I want to know:
-
For instance, IP-Adapter also utilizes pretrained CLIP as an image encoder, yet it is capable of accepting multiple images as conditions. So how to understand ConvNeXt2 can adapt to the input of multiple reference images? (In the code, it appears that multiple images are concatenated together as a tensor of shape [N, C, H, W], Why can't CLIP utilize a similar approach?)
-
The conclusion that ConvNeXt2 can extract both high-level and low-level features compared to CLIP, where is this derived from? Or is the rationale for choosing ConvNeXt2 merely because it is lightweight enough to be finetuned during the training process?
-
When multiple reference views are encoded as encoder hidden states and injected into the cross-attention mechanism to promote reference-to-target consistency, is it necessary for each reference view to share a field of view with the target view? This is easily achievable in object generation tasks, but in scene-level generation, the increased range of camera movement means that not all reference views may share a field of view with a particular target view. These images may not provide useful information for the generation of the target view. Therefore, does the Eschernet's approach of promoting refer-to-target consistency through cross-attention still work in such scenarios? Or do you have any suggestions?
Looking forward to your reply.