transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Implement LlamaGen for Image Generation

Open ighoshsubho opened this issue 1 year ago • 11 comments

Feature request

Add support for LlamaGen, an autoregressive image generation model, to the Transformers library. LlamaGen applies the next-token prediction paradigm of large language models to visual generation.

Paper: https://arxiv.org/abs/2406.06525 Code: https://github.com/FoundationVision/LlamaGen

Key components to implement:

  1. Image tokenizer
  2. Autoregressive image generation model (based on Llama architecture)
  3. Class-conditional and text-conditional image generation
  4. Classifier-free guidance for sampling

Motivation

LlamaGen demonstrates that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art image generation performance. Implementing it in Transformers would enable easier experimentation and integration with existing language models.

Your contribution

I can help by contributing to this model, and provide examples and detailed explanations of the model architecture and training process if needed.

ighoshsubho avatar Oct 03 '24 05:10 ighoshsubho