Diffusion_models_from_scratch
Diffusion_models_from_scratch copied to clipboard
Probable Error in Computing Positional Encodings
Hi, thank you for your great repository. I think there is an issue in Line 27-28 in src/blocks/PositionalEncoding.py.
# Sin/Cos transformation for even, odd indices
embeddings[::2] = embeddings[::2].sin()
embeddings[1::2] = embeddings[1::2].cos()
I think it has to be
# Sin/Cos transformation for even, odd indices
embeddings[:, ::2] = embeddings[:, ::2].sin()
embeddings[:, 1::2] = embeddings[:, 1::2].cos()
Nice catch! Looks like I got the indexing wrong here, encoding the batch dimension as the "time" dimension. I suppose that means the position in the diffusion process becomes useless during training. It's interesting to me how the model can learn without the time information. At the moment, it would be too much to retrain the models, so I'll add some documentation, informing others of this issue.
Nice catch! Looks like I got the indexing wrong here, encoding the batch dimension as the "time" dimension. I suppose that means the position in the diffusion process becomes useless during training. It's interesting to me how the model can learn without the time information. At the moment, it would be too much to retrain the models, so I'll add some documentation, informing others of this issue.
Any progress on this? I can also hardly convince myself the function and importance of timestep embedding in diffusion model. And I am doing some tests on this too.
I added a few warnings to mention the issue when running the code: https://github.com/gmongaras/Diffusion_models_from_scratch/blob/f2e76317d70eb565a953f4959f5781a010177318/src/blocks/PositionalEncoding.py#L19
I also changed the code to correctly index the batch and time dimensions: https://github.com/gmongaras/Diffusion_models_from_scratch/blob/f2e76317d70eb565a953f4959f5781a010177318/src/blocks/PositionalEncoding.py#L34
However since the pre trained models were trained under the incorrect indexing, they still need to use the "batch" positional encodings instead of the correct ones. This just means the positional embeddings are useless to the model and likely treated as noise since there is no correlation between image and time embedding. It would have had to find correlation between images in the batch and time embeddings which doesn't make much sense 😸