Diffusion_models_from_scratch Probable Error in Computing Positional Encodings

Hi, thank you for your great repository. I think there is an issue in Line 27-28 in src/blocks/PositionalEncoding.py.

        # Sin/Cos transformation for even, odd indices
        embeddings[::2] = embeddings[::2].sin()
        embeddings[1::2] = embeddings[1::2].cos()

I think it has to be

        # Sin/Cos transformation for even, odd indices
        embeddings[:, ::2] = embeddings[:, ::2].sin()
        embeddings[:, 1::2] = embeddings[:, 1::2].cos()

Feb 13 '24 07:02 RohollahHS

Nice catch! Looks like I got the indexing wrong here, encoding the batch dimension as the "time" dimension. I suppose that means the position in the diffusion process becomes useless during training. It's interesting to me how the model can learn without the time information. At the moment, it would be too much to retrain the models, so I'll add some documentation, informing others of this issue.

Feb 13 '24 15:02 gmongaras

Nice catch! Looks like I got the indexing wrong here, encoding the batch dimension as the "time" dimension. I suppose that means the position in the diffusion process becomes useless during training. It's interesting to me how the model can learn without the time information. At the moment, it would be too much to retrain the models, so I'll add some documentation, informing others of this issue.

Any progress on this? I can also hardly convince myself the function and importance of timestep embedding in diffusion model. And I am doing some tests on this too.

Oct 16 '24 10:10 AnarchistKnight

I added a few warnings to mention the issue when running the code: https://github.com/gmongaras/Diffusion_models_from_scratch/blob/f2e76317d70eb565a953f4959f5781a010177318/src/blocks/PositionalEncoding.py#L19

I also changed the code to correctly index the batch and time dimensions: https://github.com/gmongaras/Diffusion_models_from_scratch/blob/f2e76317d70eb565a953f4959f5781a010177318/src/blocks/PositionalEncoding.py#L34

However since the pre trained models were trained under the incorrect indexing, they still need to use the "batch" positional encodings instead of the correct ones. This just means the positional embeddings are useless to the model and likely treated as noise since there is no correlation between image and time embedding. It would have had to find correlation between images in the batch and time embeddings which doesn't make much sense 😸

Oct 16 '24 12:10 gmongaras

Diffusion_models_from_scratch Diffusion_models_from_scratch copied to clipboard

Probable Error in Computing Positional Encodings

Diffusion_models_from_scratch
Diffusion_models_from_scratch copied to clipboard