OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Feat]: Caption/tags enhancement with multimodal LLMs

Open kabachuha opened this issue 9 months ago • 5 comments

Describe your use-case.

There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multimodal LLMs such as LLaVA-like, CogVLM or InternLM-XComposer2 models. The latter has the coolest capabilities as of now, as it allows feeding in up to 4K resolution images, captioning extremely fine-details.

On top of that, unlike existing repo models, these ones can receive text input beside the images, so it is possible to enhance the preexisting captions or tags

As shown by PixArt-series models, especially PixArt-Sigma, well captioned images. However, it applies mainly to LLM-embeddings based models (using T5 or other LLMs, with context > 300) as the models such as CLIP have very limited context length, resolution, embedding layer size and pretrain data to make good captions. (so not much good impact for SD1.5 or SDXL)

What would you like to see as a solution?

  • Add openai-api/ollama compatible calling mechanism to (batch) captioning section.
  • Add fully customizable prompt template with ability to insert the pre-available captions or Danbooru tags inside the prompt and where to put the image tokens
  • Add ability to insert jailbreaks in the start of the answer, such as Sure! Here is the description: " to game aligned models (local, but who got it from datasets such as ShareGPT-4V)
  • Add parsing of the generated descriptions, maybe with regex
  • AlignProp RL dataset generation with MLLM's preference out of multiple suggested images

Have you considered alternatives? List them here.

No response

kabachuha avatar May 22 '24 14:05 kabachuha