Training ControlNet parameters instead of finetuning
Firstly, thanks for this excellent work.
After reading the paper and experimenting with the code, I thought I'd drop a suggestion. Rather than altering a pretrained LDM model (Stable Diffusion) directly, and fine-tuning weight to account for the additional camera pose and domain, it might be beneficial to instead tune a separate set of UNet parameters (as is done in the ControlNet architecture (https://github.com/lllyasviel/ControlNet) to prevent deterioration of unconditioned model output.
Apologies for making this suggestion in a Github issue - but I didn't see contact info on your site/paper.
Hello. Thanks for your suggestions! Indeed, we don't conduct such experiments. I think your suggestion is worth trying. However, due to limited resources, we don't plan to do this in the near future. We welcome cooperations on this topic.