spellbook-docker icon indicating copy to clipboard operation
spellbook-docker copied to clipboard

Nice.

Open arthurwolf opened this issue 4 months ago • 5 comments

Really cool project.

I'm working on something similar (structurally at least), a manga-to-anime pipeline. It involves a lot of different steps/models, similar to this project:

  • Pre processing (alignment, upscaling, coloring).
  • Separating pages into panels.
  • Ordering the panels in the right reading order (took so much more effort than I thought...)
  • Segmentation (using segment-anything)
  • Extracting bubbles, the tails of bubbles/their vector, faces, bodies, backgrounds. Most of that necessitated training custom models.
  • Assigning a character identity to each face/body.
  • Making a naive association between faces and bubbles.
  • Reading the text of bubbles.
  • I feed all that data to GPT4-V, and ask it to "read" each panel, telling it what happened in previous panels, what bubble is associated with what face, etc, asking it to "understand" what is happening in the panel, and to "deduce" some associations between the items, the tone of voice, etc. I tried "just" asking GPT4-V to read manga pages without all the steps above, and it was terrible at it. But with all the provided info (which causes easily 10k-token prompts, just for the text), it gets much better at it. It's sort of "pre-chewing" the work for him.
  • That's where I am at now, the next step is going to be generating voice (what I'm working on now, bark/whisper/other models), sound effects, and then generating animation and special effects, and finally assembling all that into video.

I'll be looking closer into your project, in particular how it's organized, thanks a lot for sharing. I'd be curious if you have any insights on how you'd do manga reading if you had to.

Cheers!

masked

panel

prompt.json

prompt.txt

reading.json

response.txt

result.json

6253

6254

arthurwolf avatar Mar 05 '24 04:03 arthurwolf

page-3461-ids page-3462-ids page-3463-ids page-3464-ids page-3465-ids

arthurwolf avatar Mar 05 '24 04:03 arthurwolf

https://github.com/noco-ai/spellbook-docker/assets/108821/194a7878-4f30-4744-a9a8-f69f4a4f9591

result.json reading.json response.txt prompt.txt prompt.json panel masked

arthurwolf avatar Mar 05 '24 04:03 arthurwolf

Hello! Your approach looks good to me, and it sounds like your hard work is paying off. If I was working on this particular project, I would experiment with fine tuning llava once you have a solid dataset to see if it gives better results than OpenAI's models. I have yet to see anyone share a finetune of llava for a specific task, so am curious how well it would work. If you are posting your progress on your project anywhere, please share the link as I am interest to see it in action once you have it all working.

noco-ai avatar Mar 10 '24 19:03 noco-ai

Thanks for the feedback.

I'll soon have about two comic books worth of data which I think would be enough to start fine tuning llava, but I have two issues there: 1. this is all very new, and there are no "easy guide" to fine tuning most things, llava even less, it's all very cryptic and assuming a high technical level, and 2. my assumption for llava is that even fine tuning would require a lot more compute than I can afford.

I've tried an alternative to this: trying to get my data into the llava training dataset for the next llava version. I've opened github issues and sent some emails but so far no answer. I hope I can make it happen, I think it'd benefit not just me, but the model itself also.

About posting progress ,I'm considering starting a youtube channel with some updates, I'll post about it here if/when that happens.

Cheers, and thanks again.

arthurwolf avatar Mar 10 '24 19:03 arthurwolf