diffusers Is the use of `torch.manual_seed` in example training code correct?

Describe the bug

https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L168

This is a minor thing, but I think this should be torch.Generator().manual_seed(0). In my understanding, if torch.manual_seed is called, it sets the seed globally and could cause unexpected side effect. I think it's better not to change the global seed in the training loop.

Reproduction

No response

Logs

No response

System Info

diffusers==0.1.3 (current `main` branch `92b6dbba1a` too)

Aug 14 '22 01:08 hysts

Hey @hysts,

In my experience it's fine to do generator = torch.manual_seed(0).

You are correct that this sets the seed globally but the current generator state is kept in generator. In diffusers we always make sure to allow the user to pass generator into any random function which is then advanced when used - I think this behavior is good/fine. Could you give an example where it would be unintuitive / bad?

Aug 14 '22 12:08 patrickvonplaten

Hi, @patrickvonplaten Thanks for the reply.

In the training loop, noise and timesteps are sampled with global seed. https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L125-L130 So every time this condition https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L162 is met, the global seed in main process is reset to 0 while the global seed in other processes are not. I don't think this would really affect much in this training code, but it's weird and unintuitive behavior in my opinion and doesn't feel right to me.

I think this is a bit extreme case, but for example, in the case of image classification task, if the seed were reset every epoch, training images would be fed to the model in the same order every epoch, and also each image would be augmented in the same way as the last epoch. I think such thing should be avoided.

Also, I think the idea of using generator is to not cause such side effects in the first place, but if so, using "global" generator here doesn't make sense. Once generator = torch.manual_seed(0) is called, the result is the same whether to pass the generator or not. generator = torch.manual_seed(0) changes global seed, but generator = torch.Generator().manual_seed(0) doesn't. So, I thought passing generator = torch.Generator().manual_seed(0) was the original intention here. Am I misunderstanding something?

Aug 14 '22 14:08 hysts

Let me add a concrete toy example.

In the case without val:

torch.manual_seed(123)
# train loop
for epoch in range(3):
    print(f'--- {epoch=} ---')
    # mini batches
    for i in range(2):
        print(f'train {i}', torch.randint(0, 6, size=(3, )))
    # no val

Output:

--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
--- epoch=1 ---
train 0 tensor([0, 5, 5])
train 1 tensor([4, 5, 1])
--- epoch=2 ---
train 0 tensor([0, 1, 4])
train 1 tensor([3, 0, 3])

When using torch.Generator().manual_seed in val:

torch.manual_seed(123)
# train loop
for epoch in range(3):
    print(f'--- {epoch=} ---')
    # mini batches
    for i in range(2):
        print(f'train {i}', torch.randint(0, 6, size=(3, )))
    # val using torch.Generator().manual_seed
    generator = torch.Generator().manual_seed(0)
    print('val', torch.randint(0, 6, size=(3, ), generator=generator))

Output:

--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
val tensor([2, 3, 5])
--- epoch=1 ---
train 0 tensor([0, 5, 5])
train 1 tensor([4, 5, 1])
val tensor([2, 3, 5])
--- epoch=2 ---
train 0 tensor([0, 1, 4])
train 1 tensor([3, 0, 3])
val tensor([2, 3, 5])

Here, you can see the val tensors are fixed and the train tensors are the same as the no val case.

When using torch.manual_seed in val:

torch.manual_seed(123)
# train loop
for epoch in range(3):
    print(f'--- {epoch=} ---')
    # mini batches
    for i in range(2):
        print(f'train {i}', torch.randint(0, 6, size=(3, )))
    # val using torch.manual_seed
    generator = torch.manual_seed(0)
    print('val', torch.randint(0, 6, size=(3, ), generator=generator))

Output:

--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
val tensor([2, 3, 5])
--- epoch=1 ---
train 0 tensor([0, 1, 3])
train 1 tensor([1, 1, 1])
val tensor([2, 3, 5])
--- epoch=2 ---
train 0 tensor([0, 1, 3])
train 1 tensor([1, 1, 1])
val tensor([2, 3, 5])

Here, the val tensors are fixed, but train tensors are different from the no val case. Also, you can see that the train tensors in epoch 1 and 2 are the same. (This also makes the initial global seed 123 meaningless except for the first epoch.)

I think the second case is the intended purpose of using generator, but the current implementation is the third one. Does my point make sense to you?

Aug 14 '22 15:08 hysts

Hey @hysts,

Thanks a lot for the great write-up! This makes sense to me and I think we should change it - @anton-l could you take a look here?

Aug 16 '22 08:08 patrickvonplaten

Additionally, we can also better set the device using:

generator = torch.Generator(device=torch_device).manual_seed(0)

Aug 16 '22 08:08 patrickvonplaten

Let's maybe fix this everywhere

Aug 16 '22 08:08 patrickvonplaten

related to this issue, pipelines do not pass the generator to the step function (see: https://github.com/huggingface/diffusers/blob/bfe37f31592a8fa4780833bf4e7fbe18fa9f866c/src/diffusers/pipelines/ddpm/pipeline_ddpm.py#L61) resulting in different evaluation outputs when not resetting the default generator. while the initial noise will use the passed in generator subsequent added noise in the step function will not.

Aug 31 '22 07:08 aweinmann

Great point @aweinmann - should be fixed in this PR: https://github.com/huggingface/diffusers/pull/289/files#r959478073 :-)

Aug 31 '22 11:08 patrickvonplaten

PR was merged, so closing this one :-)

Sep 13 '22 15:09 patrickvonplaten

Hello @patrickvonplaten ! It seems that torch.manual_seed(0) is still called to set the generator in the unconditional training script.

I noticed this when tweaking the script to be able to resume a training process from a checkpoint. My shuffled Dataloader produced the same mini batches in the same order from one training to another.

Oct 26 '22 13:10 leopoldmaillard

diffusers diffusers copied to clipboard

Is the use of `torch.manual_seed` in example training code correct?

Describe the bug

Reproduction

Logs

System Info

diffusers
diffusers copied to clipboard