transformers
transformers copied to clipboard
Implement LlamaGen for Image Generation
Feature request
Add support for LlamaGen, an autoregressive image generation model, to the Transformers library. LlamaGen applies the next-token prediction paradigm of large language models to visual generation.
Paper: https://arxiv.org/abs/2406.06525 Code: https://github.com/FoundationVision/LlamaGen
Key components to implement:
- Image tokenizer
- Autoregressive image generation model (based on Llama architecture)
- Class-conditional and text-conditional image generation
- Classifier-free guidance for sampling
Motivation
LlamaGen demonstrates that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art image generation performance. Implementing it in Transformers would enable easier experimentation and integration with existing language models.
Your contribution
I can help by contributing to this model, and provide examples and detailed explanations of the model architecture and training process if needed.