OneTrainer
OneTrainer copied to clipboard
[Feat]: Caption/tags enhancement with multimodal LLMs
Describe your use-case.
There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multimodal LLMs such as LLaVA-like, CogVLM or InternLM-XComposer2 models. The latter has the coolest capabilities as of now, as it allows feeding in up to 4K resolution images, captioning extremely fine-details.
On top of that, unlike existing repo models, these ones can receive text input beside the images, so it is possible to enhance the preexisting captions or tags
As shown by PixArt-series models, especially PixArt-Sigma, well captioned images. However, it applies mainly to LLM-embeddings based models (using T5 or other LLMs, with context > 300) as the models such as CLIP have very limited context length, resolution, embedding layer size and pretrain data to make good captions. (so not much good impact for SD1.5 or SDXL)
What would you like to see as a solution?
- Add openai-api/ollama compatible calling mechanism to (batch) captioning section.
- Add fully customizable prompt template with ability to insert the pre-available captions or Danbooru tags inside the prompt and where to put the image tokens
- Add ability to insert jailbreaks in the start of the answer, such as
Sure! Here is the description: "
to game aligned models (local, but who got it from datasets such as ShareGPT-4V) - Add parsing of the generated descriptions, maybe with regex
- AlignProp RL dataset generation with MLLM's preference out of multiple suggested images
Have you considered alternatives? List them here.
No response