diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Potential regression in deterministic outputs

Open pcuenca opened this issue 1 year ago • 19 comments

Describe the bug

I've started noticing different outputs ~~in the latest version of diffusers~~ starting from diffusers 0.4.0 when compared against 0.3.0. This is my test code (extracted from a notebook):

import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

The first prompt produces identical results. The second one, however, results in different outputs:

0.3.0 labrador_0 3

main@a3efa433eac5feba842350c38a1db29244963fb5 labrador_0 6

Using DDIM, both prompts generate different images.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=scheduler)
pipe = pipe.to("cuda")
run_tests(pipe)

DDIM 0.3.0 obama_ddim_0 3

DDIM main obama_ddim_0 6

DDIM 0.3.0 labrador_ddim_0 3

DDIM main labrador_ddim_0 6

In addition, there's this post from a forum user with very different results in the img2img pipeline: https://discuss.huggingface.co/t/notable-differences-between-other-implementations-of-stable-diffusion-particularly-in-the-img2img-pipeline/24635/5. They opened another issue recently #901. Cross-referencing, may or may not be related to this issue.

Reproduction

As explained above.

Logs

No response

System Info

diffusers: main @ a3efa433eac5feba842350c38a1db29244963fb5 vs v0.3.0

pcuenca avatar Oct 19 '22 09:10 pcuenca

Update regarding the img2img pipeline: there are small differences using DDIM, I didn't notice any with PNDM or k-LMS. All tests done in full precision for now.

DDIM v0.3.0 i2i_ddim_0 3

DDIM main i2i_ddim_0 6

pcuenca avatar Oct 19 '22 10:10 pcuenca

Thanks a lot @pcuenca !

Do you think we could try to narrow down even more when the differences started happening? E.g. I guess it's between 0.3.0 and 0.4.0 , but then it'd also be nice to find out exactly which PR is the origin :-)

Otherwise happy to look into it myself in ~2 days!

patrickvonplaten avatar Oct 19 '22 10:10 patrickvonplaten

Yes, I'll investigate a bit :)

pcuenca avatar Oct 20 '22 07:10 pcuenca

Update: 0.4.0 seems to suffer from the same behavior as 0.6.0.

pcuenca avatar Oct 21 '22 10:10 pcuenca

I tried it on 0.3.0 and some 0.5.0 fork I have. And in both version the dog looks like the second one. Can you reproduce it with docker?

cccntu avatar Oct 21 '22 15:10 cccntu

I tried it on 0.3.0 and some 0.5.0 fork I have. And in both version the dog looks like the second one. Can you reproduce it with docker?

That's interesting. I haven't tried docker, but all the tests were done in the same system just updating the version of diffusers.

pcuenca avatar Oct 21 '22 18:10 pcuenca

@pcuenca Could you provide the system info? GPU type, python/torch/torchvision versions, CUDA/CUDNN versions?

ydshieh avatar Oct 22 '22 11:10 ydshieh

I can't reproduce with T4/V100 GPU on GCP for the default scheduler. Both commits give the second dog image. Also google colab gives the same image.

ydshieh avatar Oct 22 '22 13:10 ydshieh

For dog prompt with DDIM: running from commit 3f55d13 to 9bca402, there are 4 commits where image difference occurs

074-a9fdb3d => 075-6bd005e: 0.09812037150065105 099-3b747de =>100-bd8df2d: 0.0812060038248698 116-a7058f4 => 117-9ebaea5: 0.22810236612955728 132-1070e1a =>133-f1484b8: 0.17296854654947916

The last number is the ratio of equal values over the total number of values in the numpy arrays I upload the images to this repo.

For the default scheduler, also these 4 commits give some difference in the numpy arrays, but the ratio > 0.998, so visually we can't see any difference.

ydshieh avatar Oct 23 '22 08:10 ydshieh

dog

ydshieh avatar Oct 23 '22 09:10 ydshieh

Hi @ydshieh, thanks a lot for looking into this! I've done a few more experiments and these are my results:

My workstation
- GPU: 3090, CUDA 11.4
- Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.0
- Transformers version: 4.22.2

0.3.0 fp32 0 3-fp32-3090

0.6.0 fp16 (Note different ear) 0 6-fp16-3090

0.6.0 fp32 (Blue hat) 0 6-fp32-3090

Colab T4
- GPU: Tesla T4, CUDA 11.2
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1

0.3.0 fp32 0 3-fp32-colab-t4

0.6.0 fp16 0 6-fp16-colab-t4

0.6.0 fp32 0 6-fp32-colab-t4

Colab A100
- GPU: A100 (40 GB) CUDA 11.2
- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1

0.3.0 fp32 0 3-fp32-colab-a100

0.6.0 fp16 0 6-fp16-colab-a100

0.6.0 fp32 0 6-fp32-colab-a100

As you can see, results differ across versions. Even if we ignore the fp16 tests, results in the A100 Colab are different to those in the T4 environment. The A100 fp32 versions are consistent between diffusers 0.3.0 and 0.6.0 (no blue hat), and T4 fp32 are also consistent but different (blue hat). In my computer, 0.3.0 has no blue hat but 0.6.0 has.

Any idea about what might be going on?

pcuenca avatar Oct 24 '22 09:10 pcuenca

@pcuenca No useful insight from my side (at least so far).

In general, I am not very sure if it really makes sense to ensure reproducibility across different GPUs.

Regarding the difference across versions, I think we can take a look at the 4 commits I mentioned above. Notice that diffusers have (a lot of?) optimizations, which are probably the causes of the difference.

IMO, we should try to ensure the reproducibility on a fix environment but across diffusers versions. And if we do have some commit that will introduce the difference, it should be clearly recorded, so we can find that information quickly whenever necessary.

ydshieh avatar Oct 24 '22 13:10 ydshieh

Regarding GPU: 3090, I think at some commit (of the 4), the numerical difference on 3090 are larger than the other 2 GPUs, which cause the visual difference.

I believe on the other 2 GPUs, even they look the same between 0.3.0 and 0.6.0, they still have some numerical difference.

It would be nice (but not necessary) if you can get the outputs for all the 4 commits I mentioned. I don't mean we can get some clear/definite answer, but at least as some more information.

ydshieh avatar Oct 24 '22 13:10 ydshieh

Thanks a mille @ydshieh for the analysis that's super insightfull!

patrickvonplaten avatar Oct 25 '22 11:10 patrickvonplaten

Small update here:

  • 1.) We now know that we cannot guarantee reproducibility (only loosely "close" reproducibility) because of https://github.com/pytorch/pytorch/issues/87992 => therefore we can never really guarantee that the exact same images are generated across devices
  • 2.) I checked and I cannot reproduce difference of this code:
import diffusers
from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch
from IPython.display import display

def run_tests(pipe):
    torch.manual_seed(1000)
    display(pipe("A photo of Barack Obama smiling with a big grin").images[0])
    torch.manual_seed(1000)
    display(pipe("Labrador in the style of Vermeer").images[0])

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
run_tests(pipe)

between 0.3.0 and 0.7.0dev using a V100

  • 3.) The aggressive unittests: https://github.com/huggingface/diffusers/blob/82d56cf192f3a3c52e0708b6c8db4a6d959244dd/tests/models/test_models_unet_2d.py#L414 all pass for 0.3.0 This is good as it means our unet is not responsible for the potential regression above

Overall this issue to me now seems much less severe than originally and a bit part of it is probably simply to "uncontrollable" randomness

Next:

  • Add aggressive scheduler tests and check differences between 0.3.0 and 0.7.0dev
  • Add aggressive minimal step pipeline tests and check differences between 0.3.0 and 0.7.0dev

patrickvonplaten avatar Oct 31 '22 12:10 patrickvonplaten

Just a bit curious:

I checked and I cannot reproduce difference of this code:

What kind of difference you are checking/looking here, @patrickvonplaten ?

Well, if you mean there is no visual difference, there would still be numerical difference, as I have found in the analysis. I think it would still be a good idea to record when such difference occurs among commits (or on a daily basis), so we can track them easily. But just a suggestion.

ydshieh avatar Oct 31 '22 13:10 ydshieh

Yes I meant visually - we simple cannot guarantee for exact numerical equivalence across devices and all kinds of versions IMO.

patrickvonplaten avatar Nov 02 '22 10:11 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 26 '22 15:11 github-actions[bot]

Once the pipeline tests are fully updated we should also make a doc explaining the problem with reproducibility in general with diffusion models. cc @anton-l

patrickvonplaten avatar Nov 30 '22 12:11 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 24 '22 15:12 github-actions[bot]