Pedro Cuenca
Pedro Cuenca
I think the method is very well described here: https://huggingface.co/blog/assisted-generation, and there are some benchmarks with real-life gaining on different GPUs and tasks. TL;DR: it's helpful most times even when...
Oh, that's probably because scaled dot-product attention is enabled by default if torch 2 is in use. `pipe.unet.set_default_attn_processor()` should work. I can test and submit a PR in a few...
Hello @rovo79! Conversion works for me. - Would you mind sharing the exact conversion command you used, so we can try to reproduce? - Did you try with the stable...
Update: I could reproduce with PyTorch 2.1.0, which was released yesterday. In the meantime, I recommend you use PyTorch 2.0.1 to convert your model.
Another workaround is to add the following line after the pipeline has been loaded: ```py pipe.vae.set_default_attn_processor() ```
Reference: https://github.com/huggingface/diffusers/issues/3115. In addition, many ControlNet models already contain the `base_model` property (added manually or trained using the Flax script). See for example https://huggingface.co/lllyasviel/sd-controlnet-canny/blob/main/README.md
Looking into it. Thanks, I didn't realize the encoder inputs had changed.
Submitted these PRs after converting the VAE encoder again: https://huggingface.co/apple/coreml-stable-diffusion-v1-4/discussions/4/files https://huggingface.co/apple/coreml-stable-diffusion-v1-5/discussions/5/files https://huggingface.co/apple/coreml-stable-diffusion-2-base/discussions/6/files https://huggingface.co/apple/coreml-stable-diffusion-2-1-base/discussions/2/files Tested locally with this script: ```bash declare -a repos=( coreml-stable-diffusion-v1-4 coreml-stable-diffusion-v1-5 coreml-stable-diffusion-2-base coreml-stable-diffusion-2-1-base ) for repo in...
Thanks for the confirmation @keijiro! I just merged those PRs so I think this issue can be closed now :)
I followed the same path independently, and can confirm that bilinear instead of bicubic interpolation for the position encodings results in unnoticeable visual differences in the generated depth map.