InstructPix2Pix training script for SD3
Is your feature request related to a problem? Please describe. It would be awesome to see a training script for InstructPix2Pix based on Stable Diffusion 3 such as this.
Describe the solution you'd like. A training script for IP2P based on SD3.
Describe alternatives you've considered. None
Additional context. None
would you like to give it a try? if you start on it, and open a pull request, we can all work on it together and finish it.
Sure, happy to give it a shot. I'll try to send in a PR this week. Thanks!
Thank you! Main problem is that the original InstructPix2Pix dataset is too bad for high quality models like SD3 and Flux.
I tried it on SDXL ages ago and couldn't get anything sane out of it. To validate the implementation correctness, I used a small dataset from this project and tried to overfit it. I was able to do that.
Best case scenario would be if Meta releases the EMU Edit dataset but I guess they won't give us, the diffusion kids, anything soon.
we could make a synthetic dataset using an older pix2pix model? or a controlnet? or even some proprietary option. which would be best for making instruct edit data?
A couple of ideas.
There is a small high-quality test dataset from EMU Edit folks: https://huggingface.co/datasets/facebook/emu_edit_test_set.
What we could do to bootstrap a sufficiently large dataset as the one used in InstructPix2Pix:
- Take the original prompts and edit the image prompts from InstructPix2Pix and generate image pairs using Flux Pro possibly with multiple resolutions ensuring a good balance between each bucket.
- Take the prompts and edit instructions from EMU Edit and use an LLM to generate similar, maintaining diversity. And then follow the above pipeline.
EMU Edit paper provides a really good set of instructions to prepare edit datasets but the technical debt felt a little too high without any code references for me. This is at least for some task categories they cover.
Have not thought through the filtering yet but for starters, I believe this could be nice.
Have you tried using the HQEdit dataset? Here's the link: https://thefllood.github.io/HQEdit_web/
That could be nice! The data quality seems to be lots higher here so, SD3 or Flux should be good to try out. The number of sample is bit too low, though. But should be a good starting point.
If you're interested in setting up a quick script, I am happy to help you run it.
But we could likely also combine the dataset with the test set of EMU Edit just to increase coverage, WDYT?
I'm interested but rn I'm very busy with work. Next week, I'll have more availability and I think I can make progress on this!
With all the revolution around flux I think it could be interesting to try it out with that model haha!
Super! I will try to get a collated dataset (HQEdit + EMU Edit) and help run your script, experiments, etc. We can jam on ideas here too :)
Cc: @apolinario @linoytsaban for awareness.
They have deleted the edited images from the EMU Edit dataset here https://huggingface.co/datasets/facebook/emu_edit_test_set/
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
They have deleted the edited images from the EMU Edit dataset here https://huggingface.co/datasets/facebook/emu_edit_test_set/
I think its just moved here?
https://huggingface.co/datasets/facebook/emu_edit_test_set_generations
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Sure, happy to give it a shot. I'll try to send in a PR this week. Thanks!
Hey,I am also interested in conducting instructpix2pix on SD 3. Do we already have a repo for this project, we can work together!
Nice, feel free to share results :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
So where are we now ? :D I am interested in sd3, sdxl and flux ip2p has anybody had any luck?
As SD 3 is using DiTs which is a bit different compared to UNet based instructpix2pix. You can try similar method proposed in Pixart-delta which introduced controlnet-transformer for DiTs.
Chen, Junsong, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. “PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models.” arXiv, January 10, 2024. https://doi.org/10.48550/arXiv.2401.05252.
As SD 3 is using DiTs which is a bit different compared to UNet based instructpix2pix. You can try similar method proposed in Pixart-sigma which introduced controlnet-transformer for DiTs.
Chen, Junsong, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. “PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models.” arXiv, January 10, 2024. https://doi.org/10.48550/arXiv.2401.05252.
Hi @Bili-Sakura Could you illustrate the idea more? Since I've been tried to migrate InstructPix2Pix into Sana (which is also a DiT architecture), while the result is quite poor - the edited image does not even follow the original input image at all. By doing this, I added additional input channels (32->64) to the (first) patch embedding layer to support image conditioning. Not sure what the difference between UNet and DiT is to cause the results not desirable.
@ChunChenLin
I am currently working with image editing based on diffusion models, so I just share some naive insights below:
First, let's recall to your question, that InstructPix2Pix framework seems not work for NVIDIA's SANA. I would say that it may due to two reasons: a) The SANA employs a trending new Linear Transformers architecture, where it discards some key modules in classic DiTs for accelerating genertion. Though, it is reported to achieve good results in text-to-image generation, whether it also works for image-to-image generation (i.e. image editing) remaining a question. b) It is also possible that this editing framework is good far but limited to your training recipe especially due to lack of GFLOPs.
Anyway, move back to image editing framework, here are more alternatives:
- UltraEdit adopts InstructPix2Pix on Stable Diffusion 3 (MM-DiTs) which works well.
- Step1X-Edit integrate fronzen MLLM and Flux.1 (DiTs) with a connector. For input image, it follow OminiControl that adpot concatenation before feeding into DiTs.
- ICEdit adopts an in-context editing framework which use Flux.1-Fill. The image editing task is much like an inpainting task with further cropping.
For deep understanding of current trending image editing methods, I highly recommend the paper from my faculty entitled 'In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer'.
Zhang et al., In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer, arXiv 2025
channel-wise concat was shown not to work as well as sequence-level concat; see HiDream E1 and Flux Kontext's technical report. of course, the attention scale changes, and the model slows down quite a bit.