kumiko
kumiko copied to clipboard
Contact.
Hey.
I've found your repo and I'm working on something very similar (also working on bubbles, panel order/flow, face detection, character identification, association of character and bubbles, etc). I wanted to reach out because I'm curious to learn why you were working on this, and I'd like to know if you're still working on this, and if you'd like to cooperate or discuss possible solutions etc.
Apologies if I'm disturbing you.
Cheers.
Hi @arthurwolf Thanks for reaching out, it's a pleasure.
Seems like you're working a lot with LLMs. I'm not familiar with those, but I'm sure they can provide good (better?) results than a simple approach like the one I use in Kumiko (image contour detection).
Is there anything public I can check and try out? I've seen the yolov project, which seems to focus on face detection. You were mentioning panels and speech bubbles recognition as well, which are more of my concern¹. Any project detecting them?
¹ Back in 2018 I wanted to read comics on my small-screen phone, and was surprised that nobody had come up with a solution to zoom on panels easily. Or even the ability to read a comic panel after panel (zoomed-in), instead of page after page (zoomed-out). This was the original reason behind me trying out image processing and publishing the first versions of Kumiko. And it still is the same reason, basically, although I still prefer to read paper comics and Kumiko has become more of a hobby. :)
I've been working again on it lately, and would love to compare approaches!
So, what I ended up doing is:
- Segment everything in the page with "segment-anything" from meta research. From that I get a list of many (often overlapping) polygons for "elements" in the page. Some of those are panels, some are bubbles, some are just an arm or a face or a teapot.
- Create a dataset of "good" and "bad" polygons (overlaid on the original image) for a given object type (panel, bubble, face, bubble tail), then train a model to recognize which are good and which are bad. The model I train is super basic, pretty much just the example provided in the tensorflow documentation for object classification/recognition.
So that's training. And then during inference, it's first segment everything, and then detect which polygons have the best score for being a panel-type (and I do this for other types too).
For panels specifically (not bubbles etc), I also "enrich" the technique a bit. I use https://github.com/hanish3464/WORD-pytorch (the best panel segmenter I found so far) to also segment the page into polygons (those polygons are "only" panel candidates), and I add those candidate polygons to the polygons from segment-anything. I find doing this improves (a little bit) the final performance/success rate.
My dataset is made of around 300 manga pages currently, and the system is working pretty well. I expect it's going to work even better with time/as the dataset grows. Unfortunately the mangas I'm working with currently are copyrighted so I can't share anything, but I plan on working with creative-commons data at some point and releasing some weights.
Another thing I had a lot of trouble with is the "reading order" of the panels. I ended up writing my own technique, which is better than the best I found on github, but still not perfect, though I have some ways I know I can improve it, it's just a lot of work.
All in all, making good progress, I'm starting for the first time to generate short animations based on manga panels. Final goal is to generate "pseudo" anime... Will never get to a good result, but it's very fun learning a lot of things along the way.
Thanks again for the amazing project.
Cheers.
https://github.com/hanish3464/WORD-pytorch