BallonsTranslator icon indicating copy to clipboard operation
BallonsTranslator copied to clipboard

Feature Request: New Text Detection Model Support

Open RoadToNowhereX opened this issue 8 months ago • 25 comments

Version Info

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Python executable: D:\GalTranslation\BallonsTranslator_dev_src_with_gitpython_20241124\ballontrans_pylibs_win\python.exe Version: 1.4.0 Branch: dev Commit hash: a515a420e3334680c17ca7c6f800e9df2bf513c4

Type of Request

New Feature

Description

Model: https://huggingface.co/ogkalu/comic-text-and-bubble-detector Transformers model, fine-tuned from RT-DETR-v2 r50vd, CPU/GPU inference. Performs much better in text detection in complex scenes than BallonsTranslator integrated models ctd and ysgyolo.

detection examples below, all using default settings:

comic-text-and-bubble-detector: Image

comic-text-detector: Image

Pictures

No response

Additional Information

No response

RoadToNowhereX avatar Apr 22 '25 09:04 RoadToNowhereX

The point is, ctd will mix different sentences from different bubbles usually, but ogkalu/comic-text-and-bubble-detector hardly does so.

RoadToNowhereX avatar Apr 22 '25 09:04 RoadToNowhereX

Sample code from PekingU/rtdetr_r50vd:

import torch
import requests

from PIL import Image
from transformers import RTDetrForObjectDetection, RTDetrImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = RTDetrImageProcessor.from_pretrained("ogkalu/comic-text-and-bubble-detector")
model = RTDetrForObjectDetection.from_pretrained("ogkalu/comic-text-and-bubble-detector")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

result should be a list with a dict in it, for example:

[
   {
      'scores': torch.tensor([score1, score2, ..., ]),
      'labels': torch.tensor([int1, int2, int3, ..., ]), # 0 is 'bubble' , 1 is 'text_bubble' , 2 is 'text_free'
      'boxes': torch.tensor( [[xmin1, ymin1, xmax1, ymax1],
                                         [xmin2, ymin2, xmax2, ymax2],
                                         ...
                                         ] )
   }
]
[
    {
        'scores': torch.tensor([0.9821, 0.9801, 0.9799, 0.9738, 0.9730, 0.9723, 0.9722, 
                                0.9721, 0.9708, 0.9704, 0.9699, 0.9682, 0.9673, 0.9599, 
                                0.9558, 0.9532, 0.9528, 0.9512, 0.9483, 0.9421, 0.8599]),
        'labels': torch.tensor([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 2]),
        'boxes': torch.tensor([[-4.5021e-01,  6.9262e+02,  3.7822e+02,  9.2742e+02],
                               [ 4.7005e+02,  3.7906e+01,  7.5990e+02,  1.8624e+02],
                               [ 9.0669e+02,  5.0702e+02,  1.3555e+03,  7.2286e+02],
                               [ 1.0446e+03,  8.3269e+01,  1.3330e+03,  2.1791e+02],
                               [ 9.3973e+02,  3.5836e+02,  1.3344e+03,  5.1491e+02],
                               [ 3.3274e+02,  5.0518e+02,  6.1163e+02,  6.6103e+02],
                               [ 9.4728e+02,  5.3617e+02,  1.3300e+03,  7.0094e+02], # label 1
                               [ 9.9570e+01,  3.5952e+02,  4.9691e+02,  5.5726e+02],
                               [ 1.1040e+02,  3.7685e+02,  4.7161e+02,  5.1155e+02], # label 1
                               [ 9.7417e+02,  3.7178e+02,  1.3234e+03,  4.8909e+02], # label 1
                               [ 6.7811e+02,  9.7702e+02,  9.3171e+02,  1.1245e+03],
                               [ 6.8589e+02,  1.5547e+02,  9.1349e+02,  2.8003e+02],
                               [ 5.3744e+01,  7.1619e+02,  3.3000e+02,  9.0388e+02], # label 1
                               [ 7.4587e+02,  1.1171e+03,  9.5668e+02,  1.2425e+03],
                               [ 1.0777e+03,  1.0008e+02,  1.3193e+03,  1.8794e+02], # label 1
                               [ 4.9696e+02,  6.4532e+01,  7.2389e+02,  1.5935e+02], # label 1
                               [ 7.2174e+02,  9.8582e+02,  8.8026e+02,  1.1155e+03], # label 1
                               [ 7.2020e+02,  1.7782e+02,  8.8041e+02,  2.6326e+02], # label 1
                               [ 3.8203e+02,  5.4301e+02,  5.6116e+02,  6.4346e+02], # label 1
                               [ 7.7961e+02,  1.1316e+03,  9.3794e+02,  1.2208e+03], # label 1
                               [ 5.6984e+02,  4.3259e+02,  8.8979e+02,  5.8168e+02]])# label 2
    }
]

Inference notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_inference.ipynb Fine-tuning notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_finetune_on_a_custom_dataset.ipynb

RoadToNowhereX avatar Apr 22 '25 10:04 RoadToNowhereX

I can embed it as a text detector.

bropines avatar Apr 22 '25 12:04 bropines

https://github.com/dmMaze/BallonsTranslator/issues/863#issuecomment-2817111839

https://github.com/user-attachments/assets/128faac1-566a-4e4e-8433-3d9852a6a9a2

"Try the latest model provided here. It was trained on 220,000 images and should still be able to handle the image you sent. If the recognition isn't good, feel free to send it to me, and we can optimize the model later. I've been tired lately, so I haven't continued collecting data. The next training might be when I reach 250,000 or 300,000 images. But for now, it's still sufficient."

lhj5426 avatar Apr 22 '25 13:04 lhj5426

https://github.com/user-attachments/assets/27b002f2-0651-4d2c-895f-f3f4c6392e34

"I found the original comic by using image search. I think the margin of error is within an acceptable range."

lhj5426 avatar Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

用你提供的链接试了一下,确实比原来的ctd强上不少,但是漏字和混淆不同文本框的问题还是时不时就有

RoadToNowhereX avatar Apr 22 '25 13:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_v2_r50vd,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

RoadToNowhereX avatar Apr 22 '25 13:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

用你提供的链接试了一下,确实比原来的ctd强上不少,但是漏字和混淆不同文本框的问题还是时不时就有

Image 哈哈我做不到100% 不过 现在比CTD强太多了 我现在跑1000页的漫画处理 起来比CTD轻松不知道多少了 以后有时间再说了 我肝了9 个月 太累了 太枯燥了 数据直接转换就可以拿来给 这2个用 不过懒得研究 还得从头学 部署

而且横向文字的数据量本来就比较少 22万张里 也就有 1 2万? 算多了的 都是气泡数据 和竖条数据

lhj5426 avatar Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

等着你的好消息哈哈 先等 bropines 大佬把你这个模型添加上再说

lhj5426 avatar Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

lhj5426 avatar Apr 22 '25 14:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd

RoadToNowhereX avatar Apr 22 '25 14:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd*

这边没讲 部署还得去B站

lhj5426 avatar Apr 22 '25 14:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

等着你的好消息哈哈 先等 bropines 大佬把你这个模型添加上再说

~~I'm just thinking ... Does it make sense? I just haven't seen examples of INLINE text detection. So, yes, it will define text babbles, and INLINE text will be skipped?~~

UPD. I checked it out. It doesn't seem bad. I will try to add it within two days. I'm stuck with my studies, I'm sorry.

bropines avatar Apr 22 '25 16:04 bropines

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd*

这边没讲 部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

bropines avatar Apr 22 '25 16:04 bropines

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd* *

这边没讲 部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

I only know how to train models, the rest I don't understand. I learned by following tutorials on video websites. I have no technical skills, just a passion for manga and the drive to take action. I know that by controlling the nms threshold: 0.75 and confidence threshold: 0.65, the recognition results can vary slightly. I often encounter the issue where one text box is detected by two or three rectangles. From the perspective of watching manga, it's not really an issue unless you look closely. However, from an object detection standpoint, this is a big problem. But since the dataset is too small and can't compete with the massive datasets used by large companies, I have to make do with it."

lhj5426 avatar Apr 22 '25 17:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点

Image 404呢

不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd* *

这边没讲 部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

https://github.com/dmMaze/BallonsTranslator/issues/866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

lhj5426 avatar Apr 22 '25 17:04 lhj5426

#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Image

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

bropines avatar Apr 23 '25 09:04 bropines

#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Image

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

bropines avatar Apr 23 '25 10:04 bropines

#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Image Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

Image

"I see you've loaded this model. Can you push an update? I can't wait to try it out."

lhj5426 avatar Apr 23 '25 10:04 lhj5426

#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Image Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

Image

"I see you've loaded this model. Can you push an update? I can't wait to try it out."

I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.

bropines avatar Apr 23 '25 14:04 bropines

#866* * Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Image Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

Image "I see you've loaded this model. Can you push an update? I can't wait to try it out."

I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.

Image

I’m training an RTDETR with over 27,000 of my own images, and I’m not using the V2 version. https://docs.ultralytics.com/zh/models/rtdetr

lhj5426 avatar Apr 23 '25 14:04 lhj5426

For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.

bropines avatar Apr 23 '25 14:04 bropines

For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.

https://gist.github.com/bropines/d9fd69bec63793220cbb59bc39fbedd0

For now, test it like this. Put the models in the data/models/ctbd folder. In theory, if you train the same way, then your model should load automatically, but I'm not sure. I'm a little tired....

bropines avatar Apr 23 '25 17:04 bropines

Well, after the tests.

I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.

bropines avatar Apr 24 '25 07:04 bropines

Well, after the tests.

I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.

Maybe that's why mit48px can not work with it, while other ocr like one-ocr and llm-ocr work well with it.

RoadToNowhereX avatar Apr 24 '25 08:04 RoadToNowhereX