About CLIP
Thank you for your excellent work. I would like to know where the part about Text Conditioning, that is, CLIP, is reflected in the code.
@7ywx, I want to clarify😓while my blog explains how CLIP is important for text conditioning in a full Stable Diffusion model, I didn't actually implement CLIP in this simplified example code.
Here's why: I designed this code to be a smaller-scale demonstration of Stable Diffusion principles using the MNIST dataset. Instead of text prompts, I used the digit labels (0-9) as numerical conditions to guide the image generation.
If you look at the code, you'll see I used an embedding layer (self.cond_embed in the UNet_Transformer class). This layer takes the digit label (like a 3 or a 7) and converts it into a vector. Which I think you also understand that this is not the same as how CLIP encodes text. CLIP uses a much more sophisticated process to understand the meaning of words and phrases.
Then, I used these digit embeddings in my attention layers (attn3 and attn4). These layers use the digit embedding as context, helps the model to generate images that match the chosen digit.
So, in short, I demonstrated conditional image generation, but I skipped the full text-to-image part involving CLIP because this example focuses on a simpler, numerically-conditioned version of Stable Diffusion. The full text conditioning with CLIP requires much more computational power and a more complex setup, which was not in my reach 😖