LLM-groundedDiffusion
LLM-groundedDiffusion copied to clipboard
How to get the image a man rides a horse?
I try this project. It's amzing and interesting. But now, I meet a question. It's hard for me to get a good image by the text "a man rides a horse". Can you give me some advice? Thank you!
Some initial attempts (you can improve by trying more options and seeds)
You may wonder why the man's face is weird. This is a known artifact of stable diffusion on small objects that is out of our scope to fix. Generating a man with a larger proportion of face to image may help.
Thank you for your reply!
When two objects do not interact, it is easy to use layout to get perfect image. But when two objects interact, it may be hard to use layout to get good image. How to show the action between objects? For example, a man and a horse may be easy. A man rides a horse may be difficult. A man is chasing a horse may be more difficult.
Good question! This is why the space allows specifying a prompt for overall generation. Without it, you use a default prompt and don't get object interaction (SD will try to guess the object interaction, so it could also guess a man standing close to a horse on the specified location). With it, you get the object interaction (e.g., a man riding the horse, then SD knows the man is supposed to ride the horse, as shown in the generation above).
However, adding more fine-grained control to object interactions is a very useful future direction. This paper specifies the idea of "text->intermediate representation->image". You are encouraged to extend to more representations (e.g., scene graph or LLM-generated SVG that captures more information).
Examples:
Same config, overall prompt: A man standing nearby a horse (I didn't play around the hyperparam)
Same config, overall prompt: A man riding a horse