What training setup did you use?
This looks great!
Could you share some information on what setup you used for the training of the transformer model?
- how many gpu / for how long
- how many steps
- what batch size
It would be helpful to have these information to better understand the cost of training dalle models.
Here is mild commentary on this in https://github.com/kakaobrain/minDALL-E/issues/6
Hello @SeungyounShin, thanks for your test on the zero-shot image-to-image translation.
As you mention, an autoregressive text-to-image generation model can conduct unseen tasks in the zero-shot manner, even though the training dataset does not include the exactly same types of text-image pairs ! However, the capability on the zero-shot learning increases when the model size & dataset set increase together. Please note that the released minDALL-E is yet smaller scale model (1.3B params, 14M text-image pairs) than the original implementation of OpenAI (12B params, 250M text-image pairs).
The problem will be solved when a larger scale of model is trained on a larger number of training samples, and we will also release large scale of models.