diffusers
diffusers copied to clipboard
Is the use of `torch.manual_seed` in example training code correct?
Describe the bug
https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L168
This is a minor thing, but I think this should be torch.Generator().manual_seed(0). In my understanding, if torch.manual_seed is called, it sets the seed globally and could cause unexpected side effect. I think it's better not to change the global seed in the training loop.
Reproduction
No response
Logs
No response
System Info
diffusers==0.1.3 (current `main` branch `92b6dbba1a` too)
Hey @hysts,
In my experience it's fine to do generator = torch.manual_seed(0).
You are correct that this sets the seed globally but the current generator state is kept in generator. In diffusers we always make sure to allow the user to pass generator into any random function which is then advanced when used - I think this behavior is good/fine. Could you give an example where it would be unintuitive / bad?
Hi, @patrickvonplaten Thanks for the reply.
In the training loop, noise and timesteps are sampled with global seed.
https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L125-L130
So every time this condition
https://github.com/huggingface/diffusers/blob/92b6dbba1a25ed27b0bae38b089715132c7e6bcc/examples/train_unconditional.py#L162
is met, the global seed in main process is reset to 0 while the global seed in other processes are not.
I don't think this would really affect much in this training code, but it's weird and unintuitive behavior in my opinion and doesn't feel right to me.
I think this is a bit extreme case, but for example, in the case of image classification task, if the seed were reset every epoch, training images would be fed to the model in the same order every epoch, and also each image would be augmented in the same way as the last epoch. I think such thing should be avoided.
Also, I think the idea of using generator is to not cause such side effects in the first place, but if so, using "global" generator here doesn't make sense.
Once generator = torch.manual_seed(0) is called, the result is the same whether to pass the generator or not. generator = torch.manual_seed(0) changes global seed, but generator = torch.Generator().manual_seed(0) doesn't. So, I thought passing generator = torch.Generator().manual_seed(0) was the original intention here.
Am I misunderstanding something?
Let me add a concrete toy example.
- In the case without val:
torch.manual_seed(123)
# train loop
for epoch in range(3):
print(f'--- {epoch=} ---')
# mini batches
for i in range(2):
print(f'train {i}', torch.randint(0, 6, size=(3, )))
# no val
Output:
--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
--- epoch=1 ---
train 0 tensor([0, 5, 5])
train 1 tensor([4, 5, 1])
--- epoch=2 ---
train 0 tensor([0, 1, 4])
train 1 tensor([3, 0, 3])
- When using
torch.Generator().manual_seedin val:
torch.manual_seed(123)
# train loop
for epoch in range(3):
print(f'--- {epoch=} ---')
# mini batches
for i in range(2):
print(f'train {i}', torch.randint(0, 6, size=(3, )))
# val using torch.Generator().manual_seed
generator = torch.Generator().manual_seed(0)
print('val', torch.randint(0, 6, size=(3, ), generator=generator))
Output:
--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
val tensor([2, 3, 5])
--- epoch=1 ---
train 0 tensor([0, 5, 5])
train 1 tensor([4, 5, 1])
val tensor([2, 3, 5])
--- epoch=2 ---
train 0 tensor([0, 1, 4])
train 1 tensor([3, 0, 3])
val tensor([2, 3, 5])
Here, you can see the val tensors are fixed and the train tensors are the same as the no val case.
- When using
torch.manual_seedin val:
torch.manual_seed(123)
# train loop
for epoch in range(3):
print(f'--- {epoch=} ---')
# mini batches
for i in range(2):
print(f'train {i}', torch.randint(0, 6, size=(3, )))
# val using torch.manual_seed
generator = torch.manual_seed(0)
print('val', torch.randint(0, 6, size=(3, ), generator=generator))
Output:
--- epoch=0 ---
train 0 tensor([4, 1, 0])
train 1 tensor([0, 4, 2])
val tensor([2, 3, 5])
--- epoch=1 ---
train 0 tensor([0, 1, 3])
train 1 tensor([1, 1, 1])
val tensor([2, 3, 5])
--- epoch=2 ---
train 0 tensor([0, 1, 3])
train 1 tensor([1, 1, 1])
val tensor([2, 3, 5])
Here, the val tensors are fixed, but train tensors are different from the no val case. Also, you can see that the train tensors in epoch 1 and 2 are the same. (This also makes the initial global seed 123 meaningless except for the first epoch.)
I think the second case is the intended purpose of using generator, but the current implementation is the third one. Does my point make sense to you?
Hey @hysts,
Thanks a lot for the great write-up! This makes sense to me and I think we should change it - @anton-l could you take a look here?
Additionally, we can also better set the device using:
generator = torch.Generator(device=torch_device).manual_seed(0)
Let's maybe fix this everywhere
related to this issue, pipelines do not pass the generator to the step function (see: https://github.com/huggingface/diffusers/blob/bfe37f31592a8fa4780833bf4e7fbe18fa9f866c/src/diffusers/pipelines/ddpm/pipeline_ddpm.py#L61) resulting in different evaluation outputs when not resetting the default generator. while the initial noise will use the passed in generator subsequent added noise in the step function will not.
Great point @aweinmann - should be fixed in this PR: https://github.com/huggingface/diffusers/pull/289/files#r959478073 :-)
PR was merged, so closing this one :-)
Hello @patrickvonplaten ! It seems that torch.manual_seed(0) is still called to set the generator in the unconditional training script.
I noticed this when tweaking the script to be able to resume a training process from a checkpoint. My shuffled Dataloader produced the same mini batches in the same order from one training to another.