diffusers
diffusers copied to clipboard
Potential regression in deterministic outputs
Describe the bug
I've started noticing different outputs ~~in the latest version of diffusers~~ starting from diffusers 0.4.0
when compared against 0.3.0
. This is my test code (extracted from a notebook):
import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display
def run_tests(pipe):
torch.manual_seed(1000)
display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
torch.manual_seed(1000)
display(pipe("Labrador in the style of Vermeer").images[0])
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)
The first prompt produces identical results. The second one, however, results in different outputs:
0.3.0
main@a3efa433eac5feba842350c38a1db29244963fb5
Using DDIM, both prompts generate different images.
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=scheduler)
pipe = pipe.to("cuda")
run_tests(pipe)
DDIM 0.3.0
DDIM main
DDIM 0.3.0
DDIM main
In addition, there's this post from a forum user with very different results in the img2img pipeline: https://discuss.huggingface.co/t/notable-differences-between-other-implementations-of-stable-diffusion-particularly-in-the-img2img-pipeline/24635/5. They opened another issue recently #901. Cross-referencing, may or may not be related to this issue.
Reproduction
As explained above.
Logs
No response
System Info
diffusers: main
@ a3efa433eac5feba842350c38a1db29244963fb5
vs v0.3.0
Update regarding the img2img
pipeline: there are small differences using DDIM, I didn't notice any with PNDM or k-LMS. All tests done in full precision for now.
DDIM v0.3.0
DDIM main
Thanks a lot @pcuenca !
Do you think we could try to narrow down even more when the differences started happening? E.g. I guess it's between 0.3.0 and 0.4.0 , but then it'd also be nice to find out exactly which PR is the origin :-)
Otherwise happy to look into it myself in ~2 days!
Yes, I'll investigate a bit :)
Update: 0.4.0
seems to suffer from the same behavior as 0.6.0
.
I tried it on 0.3.0
and some 0.5.0
fork I have. And in both version the dog looks like the second one. Can you reproduce it with docker?
I tried it on
0.3.0
and some0.5.0
fork I have. And in both version the dog looks like the second one. Can you reproduce it with docker?
That's interesting. I haven't tried docker, but all the tests were done in the same system just updating the version of diffusers
.
@pcuenca Could you provide the system info? GPU type, python/torch/torchvision versions, CUDA/CUDNN versions?
I can't reproduce with T4/V100 GPU on GCP for the default scheduler. Both commits give the second dog image. Also google colab gives the same image.
For dog prompt with DDIM: running from commit 3f55d13
to 9bca402
, there are 4 commits where image difference occurs
074-a9fdb3d => 075-6bd005e: 0.09812037150065105 099-3b747de =>100-bd8df2d: 0.0812060038248698 116-a7058f4 => 117-9ebaea5: 0.22810236612955728 132-1070e1a =>133-f1484b8: 0.17296854654947916
The last number is the ratio of equal values over the total number of values in the numpy arrays I upload the images to this repo.
For the default scheduler, also these 4 commits give some difference in the numpy arrays, but the ratio > 0.998, so visually we can't see any difference.
Hi @ydshieh, thanks a lot for looking into this! I've done a few more experiments and these are my results:
My workstation
- GPU: 3090, CUDA 11.4
- Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.0
- Transformers version: 4.22.2
0.3.0 fp32
0.6.0 fp16
(Note different ear)
0.6.0 fp32
(Blue hat)
Colab T4
- GPU: Tesla T4, CUDA 11.2
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
0.3.0 fp32
0.6.0 fp16
0.6.0 fp32
Colab A100
- GPU: A100 (40 GB) CUDA 11.2
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
0.3.0 fp32
0.6.0 fp16
0.6.0 fp32
As you can see, results differ across versions. Even if we ignore the fp16
tests, results in the A100 Colab are different to those in the T4 environment. The A100 fp32 versions are consistent between diffusers 0.3.0 and 0.6.0 (no blue hat), and T4 fp32 are also consistent but different (blue hat). In my computer, 0.3.0 has no blue hat but 0.6.0 has.
Any idea about what might be going on?
@pcuenca No useful insight from my side (at least so far).
In general, I am not very sure if it really makes sense to ensure reproducibility across different GPUs.
Regarding the difference across versions, I think we can take a look at the 4 commits I mentioned above. Notice that diffusers
have (a lot of?) optimizations, which are probably the causes of the difference.
IMO, we should try to ensure the reproducibility on a fix environment but across diffusers
versions. And if we do have some commit that will introduce the difference, it should be clearly recorded, so we can find that information quickly whenever necessary.
Regarding GPU: 3090
, I think at some commit (of the 4), the numerical difference on 3090 are larger than the other 2 GPUs, which cause the visual difference.
I believe on the other 2 GPUs, even they look the same between 0.3.0 and 0.6.0, they still have some numerical difference.
It would be nice (but not necessary) if you can get the outputs for all the 4 commits I mentioned. I don't mean we can get some clear/definite answer, but at least as some more information.
Thanks a mille @ydshieh for the analysis that's super insightfull!
Small update here:
- 1.) We now know that we cannot guarantee reproducibility (only loosely "close" reproducibility) because of https://github.com/pytorch/pytorch/issues/87992 => therefore we can never really guarantee that the exact same images are generated across devices
- 2.) I checked and I cannot reproduce difference of this code:
import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display
def run_tests(pipe):
torch.manual_seed(1000)
display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
torch.manual_seed(1000)
display(pipe("Labrador in the style of Vermeer").images[0])
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)
between 0.3.0 and 0.7.0dev using a V100
- 3.) The aggressive unittests: https://github.com/huggingface/diffusers/blob/82d56cf192f3a3c52e0708b6c8db4a6d959244dd/tests/models/test_models_unet_2d.py#L414 all pass for 0.3.0 This is good as it means our unet is not responsible for the potential regression above
Overall this issue to me now seems much less severe than originally and a bit part of it is probably simply to "uncontrollable" randomness
Next:
- Add aggressive scheduler tests and check differences between 0.3.0 and 0.7.0dev
- Add aggressive minimal step pipeline tests and check differences between 0.3.0 and 0.7.0dev
Just a bit curious:
I checked and I cannot reproduce difference of this code:
What kind of difference you are checking/looking here, @patrickvonplaten ?
Well, if you mean there is no visual difference, there would still be numerical difference, as I have found in the analysis. I think it would still be a good idea to record when such difference occurs among commits (or on a daily basis), so we can track them easily. But just a suggestion.
Yes I meant visually - we simple cannot guarantee for exact numerical equivalence across devices and all kinds of versions IMO.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Once the pipeline tests are fully updated we should also make a doc explaining the problem with reproducibility in general with diffusion models. cc @anton-l
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.