testing the editing performance directly using stage 1's ckpt
Hi! I have a question, it seems that after training stage 1, the llava+qformer's output is aligned with clip text space. Could we directly use the llava and qformer after stage 1? Or did you have an experiment on testing the editing performance using stage 1's ckpt?
Yes, even though we do not have enough space to discuss the importance of the first stage textual alignment training in the main paper, we conduct the ablation study that it is necessary to do the first stage textual alignment, or the result is not good. I guess this is because the semantic gap of LLM and CLIP is large, and SD is hard to understand the features from LLM without textual alignment. Similar observation is also received by GILL.
Thanks for your reply! I see that the BIM's great role. There are 2 traing stage for SmartEdit, stage 1 for MLLM aligning with the CLIP text encoder and state 2 for tuning all modules. In stage 2, mllm, BIM and Unet are jointly optimized. If we directly use the mllm in stage 1 to assist the SD model and skip stage 2, how about the results? What I mean is that the stage 2's joint tuning's role.
The role of stage-2 training is to transfer the ability of the MLLM to diffusion models. Since stage-1 only aligns the MLLM with CLIP, which means the MLLM can create features like CLIP, but its own ability (reasoning, better instruction following, etc) is transferred by stage-2 joint training. Therefore stage-2 is necessary.
Gotcha!